By Tim McNamara

File extensions for data sharing sometimes lie about their contents.

Here is an algorithm to infer the actual delimiter of a CSV, TSV or any related format:

  • Assume that alpha-numeric characters (A-Z, a-z, 0-9) and the period/full stop (.) are cannot be delimiters.
  • Begin with input text a.
  • Store a short sample of a as b, by copying n lines from the input text a.
  • Rank every character that appears in b by frequency, creating candidate delimiters c
  • For every candidate delimiter d in c, split each line in b. If every line has the same number of splits, d is the delimiter.

An implementation of this idea is available on Zenodo and within the Python Package Index.

Comments

A period can be a delimiter.

In the German language a comma is used for floating-point-numbers instead of periods.
I've already had a CSV where this F*ed up my document.
i.e.: 3,1415. 3,142

Dennis M. · 18 Oct, 2020

The input-Text can have numbers on them e.g.: "3-Way intersections in the U.S".
They might usually have no Space before a letter arises - but i'm not sure if it fits all cases.

The algorithm at the end is quite good.

Dennis M. · 18 Oct, 2020
Please log in to add a comment.
Authors

Tim McNamara

Metadata

Zenodo.1402269

Published: 29 Jul, 2018

Cc by