LZW Compression
All right, as promised last week today's topic is LZW compression. As we talked about in our networking class, LZW compression is about as close as you can get to an optimal compression rate for large compression objects. The single biggest advantage of LZW is that unlike the traditional Huffman Encoding that was discussed in class, very little information needs to be transmitted from the compressor to the decompressor. Moreover, this information is known ahead of time, before any data has even been generated.
Getting into the mechanics, like Huffman encoding, LZW is a variable length-encoding, meaning that transmitted symbols can be of different sizes. In my implementation I'm just going to represent the compressed data as a list of integers for simplicity, but it's easy to imagine a smaller symbol type for large compressions.
So let's get started on how LZW works: let's call the person compressing and then sending the compressed data, 'S', and the receiver to decompress the message 'R'. LZW relies on the fact that both S and R keep a dictionary of strings that they have already seen in the transmission. As S builds its dictionary and sends further symbols to R, R will not only add the received symbols to its output, but will also build up a "mirrored dictionary" that is identical to the one S has already built. In this way both sender and receiver are kept in sync yet do not need to propagate these changes in the dictionary to one another because they are inherent to the algorithm!
I've shown the code to build the initial dictionaries, it's pretty straightforward. It takes in a string of valid characters and returns both the sender and receiver dictionaries from the initial set. At any given time, you only need one dictionary so the other can just be assigned to a dummy variable. The dictionaries initially assign an index to each character - though later strings of characters will receive an index and that is where the compression comes in.
'''
Creates the initial compression and decompression
dictionaries, and returns both.
'''
def dict_init(characters):
compress = dict()
decompress = dict()
index = 0
for char in characters:
compress[char] = index
decompress[index] = char
index += 1
return compress, decompress
Once we have our dictionaries, it's time to compress the message. This step is pretty simple, we simply move along the list of characters until we arrive at some string of characters that are not in our dictionary. We then send the longest string of characters that were found in the dictionary, and create a new entry in our dictionary for that string plus this new character, outputting the index of the old string. Thus we build our dictionary and our compressed message at the same time.
'''
Takes in a string of characters along with
a string of valid characters and returns their
compressed output as a list of integers.
'''
def compress(string, valid_characters):
# Initialize our bookkeeping
record, empty = dict_init(valid_characters)
best = ""
index = len(valid_characters)
compressed_output = []
for char in string:
# Look for the longest string in record
if best + char not in record:
# Output the best so far and make new index
compressed_output.append(record[best])
record[best+char] = index
index += 1
best = char
# Otherwise, keep looking for longer string
else:
best += char
compressed_output.append(record[best])
return compressed_output
Alright, now comes the tricky part, decompression. The deal with this is that at each character, the receiver is essentially "one step behind" the sender, because while the receiver can make a guess as to what the sender sent, it's impossible to know for certain until the next symbol is read.
For instance, say the first thing the sender sends is a '4' where 'T:4' in our dictionary. Because 'T' had already been added to the dictionary (as it is a valid symbol), the sender continued in compression but then it came to an 'h'. In this case, the index 4 is sent, and 'Th' is given a new index in the dictionary. Yet when the receiver gets that the first symbol is 'T', it has no idea what the next symbol that has been added to the dictionary is. Therefore the receiver has to keep some 'guesswork' as to what the trailing character of the item to be added to the dictionary is. You can see in the following code that on each iteration this guess is updated with the first character of the new symbol, the one which caused the last symbol to terminate and has since been added to the dictionary.
'''
Takes in a list of integers along with
a string of valid characters and returns the
decompressed output as a string of characters.
'''
def decompress(compressed, valid_characters):
# Initialize our bookkeeping
empty, record = dict_init(valid_characters)
prev = record[compressed[0]]
index = len(valid_characters)
decompressed_output = prev
record[index] = prev # Initial guess is first character
for i in compressed[1:]:
record[index] += record[i][0] # Update guess
index += 1
record[index] = record[i] # Insert new record
decompressed_output += record[index]
return decompressed_output
And there you have it! So how much compression did all that work get us? Assuming that each sent index counts for a 'symbol' in our message, we can see how compressed each message had become. As we increase the size of the message, it is much more likely that symbols will be repeated and thus we get a better compression rate.
For Shakespeare's Julius Caesar: Original size: 112472 Compressed Size: 28725 Compression rate: 0.255396898784
For Joyce's Ulysses: Original size: 1539989 Compressed Size: 305737 Compression rate: 0.198531937566
And for Tolstoy's War and Peace: Original size: 3223402 Compressed Size: 513291 Compression rate: 0.159238903494
Clearly the bigger they are, the harder they... compress. Generally speaking anyway. If you'd like the full source along with some of my tests, you can download them from here, specially zipped with none other than LZW!