Using File Hashes to Reduce Forensic Analysis

The "hashkeeper" paradigm or model was first introduced a number of years ago by Brian Deering of the National Drug Intelligence Center (www.hashkeeper.org).

Since then, computer forensic analysts have come to use the term hashkeeper when they discuss ways of using the hash values of files to assist in forensic analysis.

I will not attempt to explain or define the various hashing algorithms, nor discuss the robustness of one over the other. One informative source on hashing algorithms is a white paper written by Gary Fisher and Tim Boland. It can be found at www.nsrl.nist.gov/nsrldocs.html and describes in scientific and technical detail the characteristics of the algorithms. Suffice it to say that the MD5 algorithm (from RSA Data Security Inc.) is the one most used by computer forensic analysts. The MD5 algorithm is strong and uses a calculation that produces a 128 bit hash value that is equal to 2128 or roughly 3.4028e+38 or 10 with 38 zeros after it.

Now to describe how file hashes may be used to help identify duplicate or unique files and thus reduce the amount of files an examiner may be faced with.

Let's set a scenario. The analyst has a computer with about 100,000 files and has to identify files containing specific content, or lack thereof. The analyst could save time if it were possible to eliminate upwards of 80 to 90 percent of the 100,000 files as being already 'known' and not have to process them. Subsequent processes would need to process only 10-20,000 files instead of 100,000 files.

This is where the hashkeeper paradigm comes in. Because of the uniqueness of the MD5 algorithm it is generally agreed that no two files with dissimilar content will have the same hash value. Mathematicians will be more than happy to show how smart they are and prove that theoretically there can be two dissimilar files with identical hashes. Research the papers and decide for yourself. For now let's assume that no two dissimilar files will produce the same hash. And conversely, if two hashes are the same, then the file contents are the same. If you wish to use a more powerful algorithm in your analysis use the SHA algorithm.

What the hashkeeper paradigm comes down to is this:

1. If no two dissimilar files produce the same hash value.
2. Obtain a list of hash values from 'known' files.
3. Obtain the hashes of the suspect files.
4. Compare the two hash lists to
a. identify files matching the known files, or
b. identify the unknown files.
5. Eliminate the 'known' files from the search, and
6. Identify and review the unknown files.

Performing these steps could drastically reduce the universe of files the analyst would search through. Some analysts report that with tweaking they have as much as a 90 percent reduction in suspect files.

What are known files and known file hashes? A known file would be one that is received directly from the manufacturer or author, or is the equivalent of the shrink-wrap file. There is a high level of confidence that it is not corrupted. The known file hash is the hash of that file. With the Windows environment, there may be two versions of a known file. The first is located on the distribution CD, and often may be part of one of the .CAB files. The second version of that same file is the extracted file that may have been modified by an install program as it is copied to your hard drive. You will see later where this difference may be important. For now, just assume we have a known set of files, and hash values.

Next, obtain the hashes of the files on the 'suspect' drive. It is preferable for a program that produces file hashes to output the results to a format that is easy to work with, either in a spreadsheet or database. This usually means fixed length records, or delimited. Fixed length is more generic and easier to work with, but takes up more space.

After you have the known hashes and the suspect hashes, you need to compare the two hash files. Once the comparison is made, the analyst will have identified anywhere from 20 to 90 percent of the files as 'known.' At this point the analyst needs some way of isolating the suspect files so additional research could be done on those suspect files. One suggestion is to copy all the suspect files to a clean drive. Then 'point' your forensic software at the new drive that contains only suspect files. The output from that process will be from only suspect files and hopefully will provide useful information.

The question at this point, where do I get the 'known' hashes? One location containing about 700,000 hashes is www.hashkeeper.org. Another is www.nsrl.nist.gov where you can purchase a set of about 1,000,000 hash values from the U.S. National Institute of Standards and Technology (NIST). And finally, build your own set. This final option comes in handy if you are dealing with specific files that may not show up on either of the other two sources.

A little more information about the hashkeeper and NIST hash sets.

I downloaded the hashkeeper sets and purchased the NIST set. I figured if both agencies were attempting to accomplish the same task, then both sets should contain a lot of identical values. Was I wrong! In fact, I found 703,544 hashes in hashkeeper, and 1,014,594 hashes in the NIST set. This could be considered an acceptable difference if you take into account that maybe NIST found more files than the people at NDIC.

Maybe I was misinterpreting what each organization was trying to accomplish and must give both the benefit of the doubt. I found that NDIC usually included file hashes after the programs had been 'installed,' while NIST apparently included the hashes of the 'uninstalled' files as found in the .CAB files. Even if this were the case, I would expect a lot of similar values, assuming that after a file was installed, the value wouldn't change. Here are some numbers: Take notice of the 141,042 files that were common to both. Quite a substantial difference from what might be expected and food for thought. The other sections relate to specific packages I had on hand.

Here are some other situations where the use of hash sets might help.

For system administrators. How often do you get a software patch and wish you knew what file(s) the upgrade really altered? One way of finding out is to create a hash set of your system prior to implementing any patch or upgrade. Then immediately after the upgrade, run another hash and compare the two.

How do you handle a number of 'suspect' office computers? Assume they were setup by the staff with very similar file structures, and may even have been cloned with Norton Ghost for simplicity and standardization in setup.

Perform the hash of the first computer. Find out which files are suspect and process them accordingly. Now, consider all the files on the first computer as now 'known' files, and add them to a composite list of known files. Knowing that all the common files between each computer can be added to the 'known' files you increase your list of 'knowns.' The number of 'new' suspect files from subsequent computers will decrease as your known files increase. By the time you get to computer X, you will probably have accounted for almost 90 percent of the files. This process can be automated with batch files and produce some significant work reduction.

Finally, find some way of identifying duplicate hash values that would mean a duplicate file. In most cases you could eliminate any suspected duplicates and reduce processing time.

Dan Mares is a forensic analyst, author of forensic software, and owner of Mares and Company, LLC ( www.maresware.com).