SCS Undergraduate Thesis Topics

2003-2004
Student Advisor(s) Thesis Topic
Warren Hunt Danny Sleator A Fast Counting Data Compression Algorithm

I explore a new type of data compression based on storing frequency counts. Traditional compression algorithms use either pattern building or statistic methods. Statistical methods have always proven to be most effective (although also the slowest) and much is know about their theoretical performance. Not a lot can be said about the theoretical performance of pattern building programs, although in practice, they work well, and are widely used. Data compression is used substantially in audio, video, and anywhere that datasets are large and/or bandwidth/media are small. Improvements in algorithms and theory dealing with compression help make the most of resources that are already in place.

This project focuses on compressing data storing and using byte frequency counts. For free the frequency counts give you the ability to use statistical methods, because if you know the number of times each byte occurs in the file, and hence the probabilities of occurrence. However, you also know those probabilities at each stage in the compression by writing a byte, decrementing your counter and re-computing them. Basically, by providing more information than in a strait statistical method, greater compression can be achieved.

A first order method proved to be about 10% more effective in practice than Huffman encoding, and higher orders only improve the effectiveness. Also, theoretically, it should out-perform both statistical and pattern matching algorithms in all cases. This method of compression had provided for some insight into how files are constructed statistically, and lead to some interesting file analysis/type guessing tools. Another interesting result is that this method reduces general file compression to compression of many simple collections of 256 numbers that obey many nice properties.


Close this window