Tag Archives: archiver

xPress – An Experiment In Data Compression

A long time ago I became interested in learning how to compress data in an efficient and recoverable manner. Of course the problem of data compression has long been solved by David A. Huffman of MIT way back in 1952, and more recently by PKZip’s Phil Katz. Although the things they learned way back then haven’t really stood the test of time.

Everyone, even casual computer users, know what data compression is. They know that they can select a bunch of files, zip them up, and the resultant “archive” of files will be smaller in size and more portable than the files they started with. Beyond that, though, data compression is largely a mystery to most.

I’m someone who likes to “roll my own” stuff. I’d rather write a library for an application than rely on one made my Google or Facebook. When I take on a project, I like to start as close to the supply chain as possible. I usually try to start with a couple very low-level dependencies and build up from there. It helps me gain perspective and insight into why things are done the way they are. Sure, I’m duplicating some effort, but there’s no comprehension to be gained from copy/pasting 30 year old code.

Even before I research a topic I’ll spend some time thinking about it on my own, with no pre-conceived notions of how something is typically done. I’ll try to dream-up my own solution to a well-understood problem without using any off-the-shelf solutions. This gives me the ability to understand why existing solutions are the way they are. Knowledge transcends many generations, but rationale is less commonly documented and often lost over time. By quickly re-inventing a well-understood wheel I can see exactly why the wheel is round, rather than just knowing that most wheels are round. Sometimes the only way to know why the wheel is round is to make a square wheel and see how it rolls.

Without understanding a lot about existing compression algorithms I am free to write my own and develop knowledge in my own voice rather than trying to transcribe existing knowledge in someone else’s voice that I have no context for.

So that’s what I did. I settled on Python for the language because it’s fairly platform agnostic, it’s popular, and it’s versatile. Plus Python is fun! Then I took some random data and started looking for patterns. What I found is that patterns in data are extremely prevalent, and with some heuristic analysis almost any file can be reduced by extrapolating the redundant parts of a file and representing those parts with a smaller piece of data. So any archive I create is going to need some area where it can store the original data and a small unique key for identifying where that original data belongs in a compressed file. This “dictionary” made up of indices and corresponding values will need to be embedded into the archive so that original data can be recovered and rebuilt later. There also has to be an area to store data itself, in either compressed or uncompressed form. Already our archive is taking shape!

We now have a data area and a dictionary area. We need to separate these two areas somehow so they can be distinguished during extraction. We can do this by creating some kind of prefix and suffix combination that goes before and after the dictionary. This will allow us to separate compressed+uncompressed data from the dictionary later. We want the prefix/suffix combo to be extremely unique so it doesn’t get mixed with data and slip by the extractor to pollute the rebuilt data. We also want the prefix/suffix to be small in terms of bytes. Afterall we are trying to shrink files, not grow them!

OK! So our archive file now has an area for compressed and uncompressed data, a separation mechanism, and a dictionary made up of key/value pairs. The “key” is a small identifier and the value is original redundant data. But how do we identify where there are patterns in the original data and build a dictionary out of it? We know that patterns exist inside almost every file, but we need a way to pull them out so we can put them in our dictionary.

We obviously want to put the contents of our input file into memory so we can perform this work on it. But files these days can easily be much larger than our system has memory! How are we going to handle large files if we can’t fit them in memory?

Simple! We chop them into pieces! We just need to decide how many pieces, and how big our pieces should be. We can figure out the perfect size for our pieces by checking how much memory is available and how big our files are. We’ll need to store one copy of original data, and a copy of the compressed data plus our dictionary. So our pieces should be a little smaller than 50% of available memory, since we need to fit roughly 2 copies of the file chunk at the same time (plus the dictionary and the rest of our Python code).

When a file is small enough, we process it all at once. When a file is too large, we chop it into chunks and process one chunk at a time. This way we don’t run out of memory on large files! Now we can select some data from our chunk and see if it is redundant.

First we need to look at how big the file is and decide how large our dictionary entries should be. Files with a lot of redundant data will benefit from larger dictionary entries, while files with less redundant data will benefit more from smaller dictionary entries. So we will decide how big the dictionary entries should be based on filesize. We also decide when to stop compressing, or to put it another way; we decide what is worth compressing. We may find redundant data that is so small we actually increase it’s size by representing it with a dictionary index. We need a way to measure our compression performance to know if we’re actually making worthwhile progress or just wasting CPU cycles. We can do this by setting a threshold for compression depending on the filesize. Larger files will benefit from a lower threshold, while smaller files should have a higher threshold. Basically, we look at the size of the data we’re compressing, and if it doesn’t shrink by [threshold]% then we leave that data in an uncompressed form. For files with large swaths of redundant data, like a text file, we don’t want to waste our time shrinking 4 copies of the word “cat” when we could get better compression by shrinking 3 copies of the phrase “cat food” instead. If we were happy simply compressing “cat” we would probably also wind up create a separate dictionary entry for “food” anyway. Using 2 dictionary entries to do the work of 1 takes more CPU time and more storage space to represent.

So if our compression isn’t yielding results we don’t compress anything. But if we left it at that our compression algorithm wouldn’t be much of a compression algorithm, now would it? Instead of giving up or making a wasteful amount of dictionary entries we can adjust the dictionary entry length and try again. Of course there has to be an upper limit where we decide we’re done trying, or else our program would never end! Currently xPress will try adjusting the dictionary entry length 9 times before giving up. It will try increasing and then decreasing the length of dictionary entries until it finds one that yields compression results better than the threshold we set earlier.

The logic flow of our program currently looks something like this…..

For LARGE files…

LARGE_FILE.txt -> File is larger than 50% of memory -> Chunk file into pieces -> Predict dictionary entry length -> Scan chunk for redundant data in [dictionary length] increments -> Redundant data found -> Redundant data does not meet threshold when compressed -> Write uncompressed data -> Adjust dictionary entry length -> Scan chunk for redundant data -> Redundant data found -> Redundant data DOES meet threshold when compressed -> Write compressed data -> Update dictionary -> Scan chunk for redundant data in [dictionary length] increments -> Redundant data NOT found -> Adjust dictionary entry length -> END OF CHUNK -> START NEW CHUNK -> Scan chunk for redundant data….. …..REPEAT UNTIL END OF FILE OR COMPRESSION RESULTS>THRESHOLD -> Append dictionary to output file.

For SMALL files…


SMALL_FILE.txt -> File is smaller than 50% of memory -> Predict dictionary entry length -> Scan file for redundant data in [dictionary length] increments -> Redundant data found -> Redundant data does not meet threshold when compressed -> Write uncompressed data -> Adjust dictionary entry length -> Scan file for redundant data -> Redundant data found -> Redundant data DOES meet threshold when compressed -> Write compressed data -> Update dictionary -> Scan file for redundant data in [dictionary length] increments -> Redundant data NOT found -> Adjust dictionary entry length -> Scan file for redundant data….. …..REPEAT UNTIL END OF FILE OR COMPRESSION RESULTS>THRESHOLD -> Append dictionary to output file.

This process repeats until the file is complete or xPress is unable to yield compression results that satisfy the threshold after 9 dictionary adjustments.

Extracting data from an xPress archive is a lot easier. We simply separate the dictionary from the end of the archive and iterate through it, replacing every instance of a dictionary index with the corresponding original data from the dictionary. When there are no more dictionary matches; the archive is done extracting.

This is how xPress currently works. As a result there are no headers or offsets embedded in the archive like with other compression utilities. That means that xPress only supports individual files rather than folders. I have several ideas for implementing directory trees within xPress archives and I’m excited to see the effect the changes will make on overall compression quality. In time this method for encoding data could possibly be used for other applications as well. Like representing large datasets for machine learning applications. I’m still quite a long way away from testing anything of that nature with xPress, but every project has to start somewhere.

And that’s the story of how I made my own compression algorithm without knowing anything about compression. Am I doing it right? Am I way off base? How would you do it better? Head on over to the official Github repo and let me know, or leave your comments below!