Huffman encoding uses some interesting insights about data to achieve compression. It’s reasonably simple, and exposes some interesting questions about data representation and optimization.
simplehuffman.c – (300 lines of C) is a program I wrote in an afternoon to do Huffman encoding and decoding of files. It will work on any type of file and can compress and decompress a 15MB text file in under 2 seconds on current hardware.
The basics
Huffman encoding is a variable-length encoding scheme for data. By using shorter encodings for more frequent values and longer encodings for less frequent ones, it can reduce the amount of space needed to represent data, especially if the values making up the data are not uniformly distributed. For example, most computers use 8 bits to represent characters, making 256 possible values, but the vast majority of typed text consist of lowercase letters and spaces. By using shorter encodings for those 27 values, even at the expense of longer encodings for the other values, text can be effectively compressed.
To determine the encodings to use in a Huffman scheme, you must first determine the distribution of values for the data you’re going to encode. Once the distribution is found, a binary tree is built with each value becoming a leaf node. To build the tree, nodes are placed in a priority queue of trees (each node is just a 1 node tree), sorted by frequency. The 2 lowest frequency trees get combined as the children of a new node with frequency equal the sum of the two children. This new tree is placed back into the queue. The procedure is repeated until there is only one tree in the queue. The tree constructed by this process is called a Huffman tree.
The encoding assigned to each value is described by the path to that value’s node in the tree. For example, if you want to encode an ‘a’ and the path to the ‘a’ node involves going left, then right, then left, from the root node, the encoding (in binary) would be 010 (0 = left, 1 = right). Thus decoding is done via the tree. Encoding using the tree is less convenient, so a lookup table is built by walking the tree and saving the path followed to each leaf node in that table.
Implementing Huffman in C
In terms of the algorithm, Huffman is fairly simple. The most sophisticated part of the operation is building the Huffman tree and the value-encoding lookup table. But the code for building the tree and table is actually rather short. The most difficult part in practice is the input/output. Reading and writing fixed-length data like ASCII or is quite straightforward in C, but reading and writing variable length bit strings (i.e. the Huffman encodings). Indeed, there are no bit I/O functions in the standard C library.
Probably the easiest solution involves using a single 8-bit character as a bit-buffer. When writing, bits are pushed in one at a time and packed, left to right. When the 8th bit arrives, the whole 8-bit is written to disk, and the process continues. When reading, a full 8-bit character is read and bits are pulled off (also from the left) one at a time. When the last bit is pulled off, we read the next 8-bit quantity and so on. All this requires some book-keeping, to know where we are in the buffer, but on the whole it’s not that tricky.
There are some other wrinkles. For example, since the Huffman encoding depends on the Huffman tree used, in order to decode, we need to be able to reconstruct that tree. Here it suffices to simply include a header in each encoded file which lists all the values and frequencies used to make the encoding. Similarly, when decoding we need a way to know when we’ve reached the last decodable value, otherwise (unless the encoding just happens to end on a byte boundary) we will try to read what is essentially garbage. For this I chose to create a pseudo-EOF when building the Huffman tree, and then to write it’s encoding as the final bit string for each file. Reading an encoded file will stop when a bit string decodes to the pseudo EOF.
simplehuffman follows these ideas (well, mostly!).