All right, as promised last week today's topic is LZW compression. As we talked about in our networking class, LZW compression is about as close as you can get to an optimal compression rate for large compression objects. The single biggest advantage of LZW is that unlike the traditional Huffman Encoding that was discussed in class, very little information needs to be transmitted from the compressor to the decompressor. Moreover, this information is known ahead of time, before any data has even been generated.

Getting into the mechanics, like Huffman encoding, LZW is a variable length-encoding, meaning that transmitted symbols can be of different sizes. In my implementation I'm just going to represent the compressed data as a list of integers for simplicity, but it's easy to imagine a smaller symbol type for large compressions.

So let's get started on how LZW works: let's call the person compressing and then sending the compressed data, 'S', and the receiver to decompress the message 'R'. LZW relies on the fact that both S and R keep a dictionary of strings that they have already seen in the transmission. As S builds its dictionary and sends further symbols to R, R will not only add the received symbols to its output, but will also build up a "mirrored dictionary" that is identical to the one S has already built. In this way both sender and receiver are kept in sync yet do not need to propagate these changes in the dictionary to one another because they are inherent to the algorithm!

I've shown the code to build the initial dictionaries, it's pretty straightforward. It takes in a string of valid characters and returns both the sender and receiver dictionaries from the initial set. At any given time, you only need one dictionary so the other can just be assigned to a dummy variable. The dictionaries initially assign an index to each character - though later strings of characters will receive an index and that is where the compression comes in.

    Creates the initial compression and decompression
    dictionaries, and returns both.
def dict_init(characters):
    compress = dict()
    decompress = dict()
    index = 0
    for char in characters:
        compress[char] = index
        decompress[index] = char
        index += 1
    return compress, decompress

Once we have our dictionaries, it's time to compress the message. This step is pretty simple, we simply move along the list of characters until we arrive at some string of characters that are not in our dictionary. We then send the longest string of characters that were found in the dictionary, and create a new entry in our dictionary for that string plus this new character, outputting the index of the old string. Thus we build our dictionary and our compressed message at the same time.

    Takes in a string of characters along with
    a string of valid characters and returns their
    compressed output as a list of integers.
def compress(string, valid_characters):
    # Initialize our bookkeeping
    record, empty = dict_init(valid_characters)
    best = ""
    index = len(valid_characters)
    compressed_output = []
    for char in string:
        # Look for the longest string in record
        if best + char not in record:
            # Output the best so far and make new index
            record[best+char] = index
            index += 1
            best = char
        # Otherwise, keep looking for longer string
            best += char

    return compressed_output

Alright, now comes the tricky part, decompression. The deal with this is that at each character, the receiver is essentially "one step behind" the sender, because while the receiver can make a guess as to what the sender sent, it's impossible to know for certain until the next symbol is read.

For instance, say the first thing the sender sends is a '4' where 'T:4' in our dictionary. Because 'T' had already been added to the dictionary (as it is a valid symbol), the sender continued in compression but then it came to an 'h'. In this case, the index 4 is sent, and 'Th' is given a new index in the dictionary. Yet when the receiver gets that the first symbol is 'T', it has no idea what the next symbol that has been added to the dictionary is. Therefore the receiver has to keep some 'guesswork' as to what the trailing character of the item to be added to the dictionary is. You can see in the following code that on each iteration this guess is updated with the first character of the new symbol, the one which caused the last symbol to terminate and has since been added to the dictionary.

    Takes in a list of integers along with
    a string of valid characters and returns the
    decompressed output as a string of characters.
def decompress(compressed, valid_characters):
    # Initialize our bookkeeping
    empty, record = dict_init(valid_characters)
    prev = record[compressed[0]]
    index = len(valid_characters)
    decompressed_output = prev

    record[index] = prev # Initial guess is first character

    for i in compressed[1:]:
        record[index] += record[i][0] # Update guess
        index += 1
        record[index] = record[i] # Insert new record
        decompressed_output += record[index]
    return decompressed_output

And there you have it! So how much compression did all that work get us? Assuming that each sent index counts for a 'symbol' in our message, we can see how compressed each message had become. As we increase the size of the message, it is much more likely that symbols will be repeated and thus we get a better compression rate.

For Shakespeare's Julius Caesar:
Original size: 112472 Compressed Size: 28725
Compression rate: 0.255396898784

For Joyce's Ulysses:
Original size: 1539989 Compressed Size: 305737
Compression rate: 0.198531937566

And for Tolstoy's War and Peace:
Original size: 3223402 Compressed Size: 513291
Compression rate: 0.159238903494

Clearly the bigger they are, the harder they... compress. Generally speaking anyway. If you'd like the full source along with some of my tests, you can download them from here, specially zipped with none other than LZW!