Extending run-length encoding to support arbitrary byte streams with hexadecimal notation
By Tim McNamara
Conventionally, run-length encoding is restricted to a fixed alphabet. Because the encoded form is often transmitted over human-readable media, this alphabet is usually restricted to printed characters. That makes run-length encoding unsuitable for transmitting arbitrary byte strings.
To address this, run-length encoding can be extended by converting the numerical value of each byte to its hexadecimal form. Hexadecimal numbers are often prefixed with 0x. We can replacing 0 with the actual count, in decimal, to allow for parsing to be fast and unambiguous.
Consider the input string Hello. The following two lines illustrate the different representations in conventional run-length encoding and the extended variant proposed here:
1H 1e 2l 1o1x48 1x65 2x6c 1x6fAn example implementation in Python:
def run_length_encode(text, delimiter=''):
count = 1
hex_digits = [hex(ord(x)) for x in text]
encoded_fragments = []
for (i, current) in enumerate(hex_digits):
try:
next = hex_digits[i+1]
except IndexError:
next = None
if current != next:
fragment = str(count) + current[1:]
encoded_fragments.append(fragment)
count = 0
count += 1
return delimiter.join(encoded_fragments)