3.3.2 Huffman
Cf.
Basic Huffman encoding goes as follows:
- Count the number of times each symbol appears.
- Now built a binary tree by
assigning each symbol to a node and
repeating connecting nodes in the following way until all nodes are connected:
- Find two nodes with lowest counts.
- Create a parent node for them with the sum of their counts as count and
connect the two children to the parent node.
- Now every symbol is encoded by starting at the root of the tree and appending 0 or 1
depending whether you have to go right or left to reach the node of the symbol.
- Note that since all the symbols
are at the leafs (the ends of the branches) of the tree, there is never a chance that one
code will be the prefix of another one. This property ensures that decoding
can be done by collecting bits from the coded string until
the corresponding sequence reaches a leaf and output the associated symbol.
There are a few shortcomings to the basic Huffman compression. First of all,
one has to send the Huffman tree at the beginning of the compressed file, or
the decompressor will not be able to decode it.
Furthermore, Huffman compression looks at the statistics of the whole file, so that if
a part of the code uses some character more frequently, no adjustment
is made. Even worse, sometimes (like for live information)
the whole input stream is not available when compression should start.
Because of these reasons adaptive Huffman encoding has been invented, but we will not treat
this topic.
Andreas Kriegl 2003-07-23