Detecting the delimiter in CSVs that lie
File extensions for data sharing sometimes lie about their contents.
Here is an algorithm to infer the actual delimiter of a CSV, TSV or any related format:
Extending run-length encoding to support arbitrary byte streams with hexadecimal notation
Conventionally, run-length encoding is restricted to a fixed alphabet. Because the encoded form is often transmitted over human-readable media, this alphabet is usually restricted to printed characters. That makes run-length encoding unsuitable... continue reading
Calculating Shannon entropy on octet streams can be confusing, especially with those who lack a background in mathematics. It's also not necessary in many cases. A more intuitive calculation, which is also easier to implement, involves calculating... continue reading
This is the Journal of Brief Ideas - citable ideas in fewer than 200 words.
Before you can create a new idea, you'll need to log in using the link above. You also can't vote on existing ideas without signing in too.
Click on the icon to vote on an idea. You can't vote on your own ideas.