From www.cs.sfu.ca/.../LZW.html:
LZW compression has its roots in the work of Jacob Ziv and Abraham Lempel. In 1977, they published a paper on "sliding-window" compression, and followed it with another paper in 1978 on "dictionary" based compression. These algorithms were named LZ77 and LZ78, respectively. Then in 1984, Terry Welch made a modification to LZ78 which became very popular and was dubbed LZW (guess why). The LZW algorithm is what we are going to talk about here.
The Concept:
Many files, especially text files, have certain strings that repeat very often, for example "the" in english text. With the spaces, the string takes 5 bytes, or 40 bits to encode. But what if we were to add the whole string to the list of characters after the last one, at 256. Then every time we came across "the", we could send the code 256 instead of 32,116,104,101,32. This would take 9 bits instead of 40 (since 256 does not fit into 8 bits).
This is exactly the approach that LZW compression takes. It starts with a "dictionary" of all the single character with indexes 0..255. It then starts to expand the dictionary as information gets sent through. Pretty soon, redundant strings will be coded as a single bit, and compression has occured.
The Algorithm:
Ok, so how is this done? Here is the basic algorithm:
STRING = get input character WHILE there are still input characters DO CHARACTER = get input character IF STRING+CHARACTER is in the string table then STRING = STRING+character ELSE output the code for STRING add STRING+CHARACTER to the string table STRING = CHARACTER END of IF END of WHILE output the code for STRING
So what happens here? The program reads one character at a time. If the code is in the dictionary, then it adds the character to the current work string, and waits for the next one. This occurs on the first character as well. If the work string is not in the dictionary, (such as when the second character comes across), it adds the work string to the dictionary and outputs the work string without the new character. It then sets the work string to the new character.
How about decompression?
Note that when a new entry is added to the dictionary, then its root is given by the code that is being sent and the suffix character is the first chacter for the code which is send next.
So in order to decompress we have to build up the dictionary by adding a new entry every time a code is received. The entries root is just the translation of the code being received and its suffix character is the first character of the translation of the next code which gets received.
Read OLD_CODE output OLD_CODE WHILE there are still input characters DO Read NEW_CODE STRING = get translation of NEW_CODE output STRING CHARACTER = first character in STRING add OLD_CODE + CHARACTER to the translation table OLD_CODE = NEW_CODE END of WHILE
There is one problem, namely if the next code being sent is identically to the one just being added to the dictionary. So this happends if "root" was already found, "root"+'suffix' appears first and 'suffix' + what follows equals "root"+'suffix'. So the input looks like:
In this case the decoder doesn't know yet the translation of this code but since the its first character has to be equal to the first character of the current code, in can use this one instead.
CHARACTER = read first code output CHARACTER OLD_CODE = CHARACTER WHILE there are still input characters DO NEW_CODE = read next code IF NEW_CODE is not in the translation table THEN STRING = get translation of OLD_CODE STRING = STRING + CHARACTER ELSE STRING = get translation of NEW_CODE END of IF output STRING CHARACTER = first character in STRING add OLD_CODE + CHARACTER to the translation table OLD_CODE = NEW_CODE END of WHILE
The nice thing is that the decompressor builds its own dictionary on its side, that matches exactly the compressor's, so that only the codes need to be sent.
For an explanation concerning the gif variant see:
www.dcs.ed.ac.uk/.../GIF-comp.txt
Andreas Kriegl 2003-07-23