Detecting the delimiter in CSVs that lie

File extensions for data sharing sometimes lie about their contents.

Here is an algorithm to infer the actual delimiter of a CSV, TSV or any related format:

  • Assume that alpha-numeric characters (A-Z, a-z, 0-9) and the period/full stop (.)... continue reading

Extending run-length encoding to support arbitrary byte streams with hexadecimal notation

Conventionally, run-length encoding is restricted to a fixed alphabet. Because the encoded form is often transmitted over human-readable media, this alphabet is usually restricted to printed characters. That makes run-length encoding unsuitable... continue reading

Average difference entropy

Calculating Shannon entropy on octet streams can be confusing, especially with those who lack a background in mathematics. It's also not necessary in many cases. A more intuitive calculation, which is also easier to implement, involves calculating... continue reading

Welcome

This is the Journal of Brief Ideas - citable ideas in fewer than 200 words.

Before you can create a new idea, you'll need to log in using the link above. You also can't vote on existing ideas without signing in too.

Voting

Click on the icon to vote on an idea. You can't vote on your own ideas.