All posts by Justin Grimes

(@zelon88) - The founder, lead developer, and DPO for HonestRepair.

xPress – An Experiment In Data Compression

A long time ago I became interested in learning how to compress data in an efficient and recoverable manner. Of course the problem of data compression has long been solved by David A. Huffman of MIT way back in 1952, and more recently by PKZip’s Phil Katz. Although the things they learned way back then haven’t really stood the test of time.

Everyone, even casual computer users, know what data compression is. They know that they can select a bunch of files, zip them up, and the resultant “archive” of files will be smaller in size and more portable than the files they started with. Beyond that, though, data compression is largely a mystery to most.

I’m someone who likes to “roll my own” stuff. I’d rather write a library for an application than rely on one made my Google or Facebook. When I take on a project, I like to start as close to the supply chain as possible. I usually try to start with a couple very low-level dependencies and build up from there. It helps me gain perspective and insight into why things are done the way they are. Sure, I’m duplicating some effort, but there’s no comprehension to be gained from copy/pasting 30 year old code.

Even before I research a topic I’ll spend some time thinking about it on my own, with no pre-conceived notions of how something is typically done. I’ll try to dream-up my own solution to a well-understood problem without using any off-the-shelf solutions. This gives me the ability to understand why existing solutions are the way they are. Knowledge transcends many generations, but rationale is less commonly documented and often lost over time. By quickly re-inventing a well-understood wheel I can see exactly why the wheel is round, rather than just knowing that most wheels are round. Sometimes the only way to know why the wheel is round is to make a square wheel and see how it rolls.

Without understanding a lot about existing compression algorithms I am free to write my own and develop knowledge in my own voice rather than trying to transcribe existing knowledge in someone else’s voice that I have no context for.

So that’s what I did. I settled on Python for the language because it’s fairly platform agnostic, it’s popular, and it’s versatile. Plus Python is fun! Then I took some random data and started looking for patterns. What I found is that patterns in data are extremely prevalent, and with some heuristic analysis almost any file can be reduced by extrapolating the redundant parts of a file and representing those parts with a smaller piece of data. So any archive I create is going to need some area where it can store the original data and a small unique key for identifying where that original data belongs in a compressed file. This “dictionary” made up of indices and corresponding values will need to be embedded into the archive so that original data can be recovered and rebuilt later. There also has to be an area to store data itself, in either compressed or uncompressed form. Already our archive is taking shape!

We now have a data area and a dictionary area. We need to separate these two areas somehow so they can be distinguished during extraction. We can do this by creating some kind of prefix and suffix combination that goes before and after the dictionary. This will allow us to separate compressed+uncompressed data from the dictionary later. We want the prefix/suffix combo to be extremely unique so it doesn’t get mixed with data and slip by the extractor to pollute the rebuilt data. We also want the prefix/suffix to be small in terms of bytes. Afterall we are trying to shrink files, not grow them!

OK! So our archive file now has an area for compressed and uncompressed data, a separation mechanism, and a dictionary made up of key/value pairs. The “key” is a small identifier and the value is original redundant data. But how do we identify where there are patterns in the original data and build a dictionary out of it? We know that patterns exist inside almost every file, but we need a way to pull them out so we can put them in our dictionary.

We obviously want to put the contents of our input file into memory so we can perform this work on it. But files these days can easily be much larger than our system has memory! How are we going to handle large files if we can’t fit them in memory?

Simple! We chop them into pieces! We just need to decide how many pieces, and how big our pieces should be. We can figure out the perfect size for our pieces by checking how much memory is available and how big our files are. We’ll need to store one copy of original data, and a copy of the compressed data plus our dictionary. So our pieces should be a little smaller than 50% of available memory, since we need to fit roughly 2 copies of the file chunk at the same time (plus the dictionary and the rest of our Python code).

When a file is small enough, we process it all at once. When a file is too large, we chop it into chunks and process one chunk at a time. This way we don’t run out of memory on large files! Now we can select some data from our chunk and see if it is redundant.

First we need to look at how big the file is and decide how large our dictionary entries should be. Files with a lot of redundant data will benefit from larger dictionary entries, while files with less redundant data will benefit more from smaller dictionary entries. So we will decide how big the dictionary entries should be based on filesize. We also decide when to stop compressing, or to put it another way; we decide what is worth compressing. We may find redundant data that is so small we actually increase it’s size by representing it with a dictionary index. We need a way to measure our compression performance to know if we’re actually making worthwhile progress or just wasting CPU cycles. We can do this by setting a threshold for compression depending on the filesize. Larger files will benefit from a lower threshold, while smaller files should have a higher threshold. Basically, we look at the size of the data we’re compressing, and if it doesn’t shrink by [threshold]% then we leave that data in an uncompressed form. For files with large swaths of redundant data, like a text file, we don’t want to waste our time shrinking 4 copies of the word “cat” when we could get better compression by shrinking 3 copies of the phrase “cat food” instead. If we were happy simply compressing “cat” we would probably also wind up create a separate dictionary entry for “food” anyway. Using 2 dictionary entries to do the work of 1 takes more CPU time and more storage space to represent.

So if our compression isn’t yielding results we don’t compress anything. But if we left it at that our compression algorithm wouldn’t be much of a compression algorithm, now would it? Instead of giving up or making a wasteful amount of dictionary entries we can adjust the dictionary entry length and try again. Of course there has to be an upper limit where we decide we’re done trying, or else our program would never end! Currently xPress will try adjusting the dictionary entry length 9 times before giving up. It will try increasing and then decreasing the length of dictionary entries until it finds one that yields compression results better than the threshold we set earlier.

The logic flow of our program currently looks something like this…..

For LARGE files…

LARGE_FILE.txt -> File is larger than 50% of memory -> Chunk file into pieces -> Predict dictionary entry length -> Scan chunk for redundant data in [dictionary length] increments -> Redundant data found -> Redundant data does not meet threshold when compressed -> Write uncompressed data -> Adjust dictionary entry length -> Scan chunk for redundant data -> Redundant data found -> Redundant data DOES meet threshold when compressed -> Write compressed data -> Update dictionary -> Scan chunk for redundant data in [dictionary length] increments -> Redundant data NOT found -> Adjust dictionary entry length -> END OF CHUNK -> START NEW CHUNK -> Scan chunk for redundant data….. …..REPEAT UNTIL END OF FILE OR COMPRESSION RESULTS>THRESHOLD -> Append dictionary to output file.

For SMALL files…


SMALL_FILE.txt -> File is smaller than 50% of memory -> Predict dictionary entry length -> Scan file for redundant data in [dictionary length] increments -> Redundant data found -> Redundant data does not meet threshold when compressed -> Write uncompressed data -> Adjust dictionary entry length -> Scan file for redundant data -> Redundant data found -> Redundant data DOES meet threshold when compressed -> Write compressed data -> Update dictionary -> Scan file for redundant data in [dictionary length] increments -> Redundant data NOT found -> Adjust dictionary entry length -> Scan file for redundant data….. …..REPEAT UNTIL END OF FILE OR COMPRESSION RESULTS>THRESHOLD -> Append dictionary to output file.

This process repeats until the file is complete or xPress is unable to yield compression results that satisfy the threshold after 9 dictionary adjustments.

Extracting data from an xPress archive is a lot easier. We simply separate the dictionary from the end of the archive and iterate through it, replacing every instance of a dictionary index with the corresponding original data from the dictionary. When there are no more dictionary matches; the archive is done extracting.

This is how xPress currently works. As a result there are no headers or offsets embedded in the archive like with other compression utilities. That means that xPress only supports individual files rather than folders. I have several ideas for implementing directory trees within xPress archives and I’m excited to see the effect the changes will make on overall compression quality. In time this method for encoding data could possibly be used for other applications as well. Like representing large datasets for machine learning applications. I’m still quite a long way away from testing anything of that nature with xPress, but every project has to start somewhere.

And that’s the story of how I made my own compression algorithm without knowing anything about compression. Am I doing it right? Am I way off base? How would you do it better? Head on over to the official Github repo and let me know, or leave your comments below!

HRCloud2 v3.0.7 – Improve Drive & Shared indexes

-v3.0.7. (1) (2) (3) 
-Drive and shared indexes now truncate long filenames.
-Truncation involves chopping the middle out of long names and replacing it with '... '.
-This allows enough room for favorite and shared icons as well as filename, and gives extra long names a breakpoint.
-Add title and alt text to all table row items.
-I need to figure out a way to purge Shared index files from a users Cloud drive.

Affordable Cybersecurity Practices for Small Business

Today’s blog post is a guest post by Lindsey Weiss from Outbounding.com. Thanks Lindsey!

Data privacy has become a huge concern for business owners small and large in recent years. Even with a growing emphasis on data protection, the number of exposed records continues to rise. In fact, 2018 saw 446.5 million exposed records, an enormous jump from the approximately 197.6 million records exposed throughout 2017.

Data privacy has become a huge concern for business owners small and large in recent years. Even with a growing emphasis on data protection, the number of exposed records continues to rise. In fact, 2018 saw 446.5 million exposed records, an enormous jump from the approximately 197.6 million records exposed throughout 2017.

Enterprises are taking significant steps to protect their data, but small businesses have been slower to catch up — only 14 percent of small businesses are highly confident in their cybersecurity. Because breaches targeting large enterprises are the ones that generally receive the most coverage, small business owners make the faulty assumption that they’re less vulnerable to a cyber attack. However, that couldn’t be further from the truth: 43 percent of all cyberattacks are aimed at small businesses.

If you store customer data, including credit card data, email addresses, billing addresses, and phone numbers, your business needs to be concerned about cybersecurity. Even if you don’t store customer data, data security should be on your radar: If a malicious actor injects ransomware into your system, you could be charged a ransom just to resume operations.

Protecting yourself against data breaches doesn’t require an enormous financial investment. There are many cost-effective ways small businesses can guard their data.

Train Employees to Recognize Social Engineering

Employee training offers the best ROI when it comes to small business data protection. That’s because employee and contractor negligence is behind nearly half of all data breaches. If an employee unwittingly clicks on a malicious attachment or shares passwords or files with a cybercriminal posing as a colleague, the integrity of your business is compromised. Social engineering attacks are constantly evolving, so business owners and managers should stay abreast of the most frequently used techniques and train employees how to recognize attacks and avoid falling victim. A few minutes of research and a meeting with your staff could save thousands in data breach recovery costs.

Step Up Your Password Policy

Are your employees using weak passwords like their birthdates, or worse, “123456” or “password”? If you reflexively answered “no,” ask yourself how confident you really are that your staff is using passwords that can’t be cracked. A strong password policy doesn’t simply require a mixture of letters, numbers, and symbols. Rather, it obligates users to create complex passwords that expire on a predetermined schedule, don’t employ common words, and are never used for multiple accounts. If you don’t want to babysit your employees’ password practices, consider using a password manager.

Keep Firewalls and Antivirus Current

Firewall protection prevents malicious actors from entering your system, whereas antivirus and anti-malware software detects and removes threats. These security solutions make up the foundation of any network’s data protection, but too often business owners let them fall out of date. Firewall and antivirus software providers regularly release updates to block new types of malware, but if you don’t update your software, your systems aren’t protected.

Backup Your Data, Then Back It Up Again

If your data is held ransom, will your business be forced to shut down? Data backups keep your business up and running when data is compromised due to a data breach, natural disaster, or another threat. A basic backup strategy for small businesses is a 3-2-1 backup. The 3-2-1 rule dictates that you keep three copies of your data (including the primary copy) and use two different mediums to store them, with one backup stored off-site. Many small businesses accomplish this by storing one backup on an on-site external hard drive and a second backup in the cloud. Both backups must be updated regularly to preserve data integrity.

These steps greatly reduce the risk to your small business’s data, but they don’t eliminate it. If you are the target of a data breach, make sure you take the appropriate steps to recover. Dealing with the fallout from a data breach isn’t pleasant, but addressing it is necessary for the continued success of your small business.

Image via Pexels

HRCloud2 v3.0.6 – Improve security, consistency of codebase

-v3.0.6. 
-Add a check to ensure a generated .AppData user-data package actually belongs to the user it's being delivered to.
-Removed redundant code from compatibilityCore.
-Add the backupUserDataNow and downloadUserData API calls to sanitizeCore (their values are never used, but sanitizeCore contains the official API reference so these should really be in there). -Changed to an exec function from shell_exec.

Statement on 7z encryption bug & PHP-Pear Supply Chain attack

At HonestRepair we monitor and test our dependencies for vulnerabilities regularly. It helps ensure that our platform is capable of meeting your needs for privacy and security.

This past week we were made aware of a bug within one of our dependencies, and a possible backdoor in the supply chain of another dependency.

The first dependency to be affected was 7zipper. This package was affected by a weak random number generation technique that affects the integrity of archives encrypted and password protected with 7z. Details about the discovery can be found on this blog post.

Since our products and services don’t utilize password protection features of 7z archives, this bug doesn’t affect our services or our software. Still, 7z is a dependency of our products so users should be aware of these vulnerabilities and update 7z as soon as a patch becomes available.

There were also backdoors found in the PHP-Pear package that are described in this blog post. Users who installed PHP-Pear on a server in the last 6 months should download the latest update and scrutinize their servers for remote access trojans (RAT’s). We have done the same and found no evidence to suggest that our servers were compromised.

ISP Contract Renewal/Network Upgrade

The time has come for me to renew the contract for HonestRepair’s ISP. As a result we get to take advantage of some infrastructure improvements on their end with no impact to costs on our end.

On paper the new contract increases our incoming bandwidth by a factor of 6.33. Our outgoing bandwidth increased by a factor of 5.6. We have observed real-world increases by a factor of 4.5 – 5.5 so far.

I’ve also revisited the websites performance and managed to decrease most page request sizes down to <310kb with 8 requests for the homepage and 9 for the average blog post. All requests made for scripts, styles, or resources are hosted by us, and we average a 97-99% (A) PageSpeed score from Google and a 94% (A) rating from Yahoo’s YSlow on the eastern seaboard of the United States. We’re still tracker free and using in-house (Google-free) analytics.

Why Does IT Suck With Computers?

I’m pretty sure everyone has had an experience with tech support where they called into question, either verbally or mentally, the qualifications of the person on the other end of the line. Its even more frustrating when that person is struggling through the same problems you were struggling through, and seems to know little about the problem or how to even begin to fix it.

The truth is, your IT guy doesn’t have all the answers in his head already. He has to understand the problem and develop a solution. Technology is as powerful as it is today by compounding the knowledge of thousands of programmers over decades of development. This knowledge combines on our technology to create an ecosystem. In order to diagnose problems in this ecosystem, one must understand what makes it tick.

Your IT guy knows these ecosystems. He knows how the different components react with one another. He knows what components require other components. He knows what components conflict with each other. But no two ecosystems are alike, and it takes your IT guy time to build a representative model of that ecosystem in his mind. You don’t pay your IT guy for fixing your problems. You pay him for knowing how to fix your problems. If you were paying him for simple action without knowledge you wouldn’t need an IT guy at all. Anyone could do it!

So What’s Taking So Long?

Your IT guy is going to want to verify the problem for himself. This doesn’t mean he doesn’t believe your description of it. This means he needs to see for himself all the factors that go into the problem, and what about the result was undesirable. He will want to see the affected system. He will want as much context about what’s going on as possible.

He may want to setup his own ecosystem where he can tweak some of the variables and recreate the problem on his own. This will allow him to measure the problem and the effectiveness of his solutions. This will also help to identify anything about the problem that you might have missed. Maybe the problem you’re reporting is a side-effect of a larger problem.

He’s also probably going to search online for other instances of the problem. As much as this can be seen to non-techies as a weakness, it is most certainly a strength. There’s millions of people out there searching for solutions for things online. Why duplicate work when someone else could already have the answer? If you’re considering shaming IT for utilizing an internet browser because you don’t consider it work; you’re costing yourself productivity. It’s like telling a mechanic to use a ratchet instead of a pneumatic wrench because he might use the air compressor to paint his car instead. Besides, learning is work. If your IT person didn’t have a knack and passion for learning they wouldn’t be in IT in the first place. You should encourage your IT personnel to set aside some time on a regular basis for learning new skills and researching things they aren’t confident about. This will pay off in the long run when your IT staff are better prepared to solve problems and better equipped to tackle projects. It also increases morale which improves productivity and effectiveness.

Are You Sure He’s Not Just An Idiot?

That depends. I know I certainly don’t have a 100% solution rate and I sleep just fine at night. Some problems can require expertise outside what you’ve got, which is why it’s important you allow your IT people to learn and expand their skills on the job. You can consider lack of specific expertise the fault of the IT person, but keep in mind that NO IT person on Earth possesses expertise in all subsets of IT. It’s just too much for one person to know. Take me for example. I can write computer programs and scripts in 6 Turing-complete languages. I know 4 markup languages and am familiar with 2 database engines and their query languages. On any given workday I’ll use 4 different PC operating systems. While that sounds impressive, and I’m quite proud of myself, there are still dozens (if not hundreds) of common languages, databases, and OS’s that other shops require and I’ve never seen before. So expertise is a matter of context. Plain and simple.

Of course some bugs are just outside the scope of IT in the first place. Even the best IT person can’t solve a bug in proprietary enterprise software. The best they can do is communicate the problem to your software vendors and hope that they take the problem seriously. Perhaps an even larger subset of problems are user-created. That means that if your other users were more knowledgeable that they would never have experienced a problem in the first place. As tempting as it might be to lay all of your organizations problems at the feet of your IT department come review time you must realize that not every problem has a solution. There are other metrics you should use to measure the performance of your IT group.

What differentiates a “good” IT person from a “bad” IT person, from a problem solving standpoint, is:

  • How well they organize themselves and their tool-chain,
  • How well they keep track of problems they encounter and solutions they deploy,
  • How proactive they are towards avoiding problems in the future,
  • How automated their operations are,
  • How willing they are to change the way they operate once they realize improvements are possible.
  • How well they communicate with the people they’re helping.
  • Do they follow-up?

How Can I Help IT Help Me?

Report the issue promptly and include plenty of information. Don’t be offended if they seem to re-verify what you’ve given them. It’s nothing personal and it’s not because they don’t trust you. This story about a 500 mile email reminds me to always take a client for their word. After that you will simply want to give them space.

After your ticket is closed you should be sure to report any recurrences of the issue to the same person you reported the original issue to. If a developed solution stops working and IT isn’t made aware they may continue deploying an ineffective patch and making things worse rather than better.

Ultimately being understanding of the fact that solving IT problems isn’t an exact science is the best thing you can do for your IT department. Sometimes just trusting that they want to solve your problem as much as you do is all it takes. Sure it’s hard watching someone fumble around the same buttons you were just fumbling with, but that’s the process just about any IT professional is going to take.

HRCloud2 v3.0.3 – Add backupCore, enable on-demand backups

-v3.0.3. (2)
-Fix typo in commonCore.php
-Add manual backup feature to Admin settings page.
-Add backupCore.
-Still working on a Cron script for auto backups (cron will call backupCore using a shell script).
-Still working on instructions to include auto backup information.
-Added GUI elements to settingsCore for changing backup settings & performing on-demand backups.
-Existing users will need to update their config.php file to contain an $EnableBackups = '1'; line and a $BackupLoc = '/path/to/backup/dir'; line. 

See commit messages on Github for additional details.