By Tim McNamara
File extensions for data sharing sometimes lie about their contents.
Here is an algorithm to infer the actual delimiter of a CSV, TSV or any related format:
- Assume that alpha-numeric characters (A-Z, a-z, 0-9) and the period/full stop (.) are cannot be delimiters.
- Begin with input text a.
- Store a short sample of a as b, by copying n lines from the input text a.
- Rank every character that appears in b by frequency, creating candidate delimiters c
- For every candidate delimiter d in c, split each line in b. If every line has the same number of splits, d is the delimiter.
An implementation of this idea is available on Zenodo and within the Python Package Index.
Please log in to add a comment.