By Tim McNamara

File extensions for data sharing sometimes lie about their contents.

Here is an algorithm to infer the actual delimiter of a CSV, TSV or any related format:

  • Assume that alpha-numeric characters (A-Z, a-z, 0-9) and the period/full stop (.) are cannot be delimiters.
  • Begin with input text a.
  • Store a short sample of a as b, by copying n lines from the input text a.
  • Rank every character that appears in b by frequency, creating candidate delimiters c
  • For every candidate delimiter d in c, split each line in b. If every line has the same number of splits, d is the delimiter.

An implementation of this idea is available on Zenodo and within the Python Package Index.

Comments

Please log in to add a comment.
Authors

Tim McNamara

Metadata

Zenodo.1402269

Published: 29 Jul, 2018

Cc by