A friend of mine develops on an Bible application for the iPhone, BibleXpress. Since his application includes several translations with the app, he once mentioned to me the possibility of compressing them to save space. Any compression that is done must achieve a good ratio, but more importantly, decompression must be fast. I took it upon myself to find a compression algorithm that could fit the bill.
In my test case, I worked with the NASB translation of the Bible. The raw text of this translation, minus formatting and book/chapter/verse identifiers is 3.965MB. Since the iPhone already has zlib, using gzip compression is an obvious choice. When compressed with gzip, the file size becomes 1.189MB, a significant savings. Even though bzip2 is not readily available on the iPhone (at least not that I could find), I tested its compression which produced a file size of 0.8548MB. While these mechanisms provide a significantly smaller file, when one desires a certain portion of the file, one must first decompress the entire file up to that point. This is an expensive operation on a small device such as the iPhone.
One compromise is to compress each book individually. This yields a file size of 1.242MB. However, some books, such as Psalms, are still quite large, requiring a long decompression operation to read some of it’s chapters. Since the application displays a chapter at a time, a logical compression block would be a single chapter which yields a file size of 1.628MB. While this is a large increase in file size, a single chapter can be decompressed without requiring decompression of any other parts of the file.
I was not happy with these compression ratios and was determined to find a better way. I looked into Huffman encoding, which uses a variable length string of bits to represent a symbol. Its compression ratio does not change with the order of symbols, which typically means it won’t provide as high a compression ratio. However, if you are told which bit is the start of a string, you may begin decompression without examining any of the file which precedes it. In addition, the decompression scheme only requires walking a binary tree, which means it is also quite fast. So, if Huffman encoding can be made to provide a good compression ratio, it will work well for this problem.
When using Huffman encoded, the question becomes how to decompose the string into symbols. One typical decomposition is to make each character into a symbol. While this is an easy representation, it doesn’t provide a good compression ratio. Instead, I chose to make each word into a symbols. After adding in punctuation, the tree contained 16,246 words. This resulted in a dictionary size of 0.1364MB. When compressed, the entire Bible was represented by a Huffman encoding of 0.9947MB, meaning a total of 1.131MB for the compressed stream and dictionary.
After this experiment, I concluded that Huffman encoding of words is the best for a large quantity of text. It yielded a compression ratio that was better than gzip (though not as good as bzip2), but at the same time a scheme that could decompress individual parts of the file without having to read preceding parts. In this scheme, a single chapter can be decompressed very quickly, and still have a high compression ratio.