Extending run-length encoding to support arbitrary byte streams with hexadecimal notation
By Tim McNamara
Conventionally, run-length encoding is restricted to a fixed alphabet. Because the encoded form is often transmitted over human-readable media, this alphabet is usually restricted to printed characters. That makes run-length encoding unsuitable for transmitting arbitrary byte strings.
To address this, run-length encoding can be extended by converting the numerical value of each byte to its hexadecimal form. Hexadecimal numbers are often prefixed with 0x
. We can replacing 0
with the actual count, in decimal, to allow for parsing to be fast and unambiguous.
Consider the input string Hello
. The following two lines illustrate the different representations in conventional run-length encoding and the extended variant proposed here:
1H 1e 2l 1o
1x48 1x65 2x6c 1x6f
An example implementation in Python:
def run_length_encode(text, delimiter=''):
count = 1
hex_digits = [hex(ord(x)) for x in text]
encoded_fragments = []
for (i, current) in enumerate(hex_digits):
try:
next = hex_digits[i+1]
except IndexError:
next = None
if current != next:
fragment = str(count) + current[1:]
encoded_fragments.append(fragment)
count = 0
count += 1
return delimiter.join(encoded_fragments)