Chet Hosmer, chief scientist, WetStone Tecnologies, Inc. --
Autonomous hashing and live discovery technologies are advancing rapidly and provide value and expediency for forensic investigators. It is important as we advance these solutions that we consider not only what we collect, but also engineer solutions that can prove what we collected, where we collected it, when we collected it, and by whom it was collected.

Traditionally, hashing is performed during postmortem forensic investigations and is used to maintain evidence integrity, as well as to identify known files (known good or known hostile). Digital investigators commonly utilized one-way hash technologies MD5 or SHA varieties to generate unique mathematical signatures of known files.

Autonomous hashing (over the wire, or during direct overt or covert interactions) – the process of collecting hash values from live running systems – can significantly speed the identification of known threats and known files that users should or shouldn’t possess.

Performance enhancement is obtained by performing the hashing function utilizing the target machine’s computing resources – in other words, off-loading the processing to the target. This approach has two important benefits: the content of the files, directories or drives being hashed don’t pass over the network, which could potentially expose non-encrypted proprietary data; and the performance is dramatically improved, especially if multiple targets are being processed simultaneously, resulting in a reduction of network traffic congestion reduced.

Autonomous hashing is accomplished by pushing a small software agent to the target machine (credentialed access to the target under investigation is required to accomplish this, or the agent must be installed a priori). The hashing agent is then instructed to gather hashes from the target machine and report back results when completed.

The agent can be instructed to collect hashes from all drives and devices permanently or temporarily attached; searches can be further restricted to specific directories or file types. This can include USB or Firewire drives, local or remote network drives, or mounted or encrypted file systems.

Once the collection of hashes (and associated file attributes) is completed, the agent delivers a report back to the investigator workstation with the result. It most cases this report is delivered as a compressed and encrypted XML document that is ready for post processing by the investigator. The reason this document is encrypted is to prevent the disclosure of file system data collected by the agent. Even though the file contents are not included in this report, file system information contained in the report still may contain proprietary data that requires protection.

Post processing of the resulting discovery provides investigators with a wealth of data regarding the target.

Obviously, a file system inventory may reveal recent documents, population of images, audio files, movies, application data, documents etc. In addition, based on the hash values collected, a comparison of hashes collected to known good (operating system programs, application files, development tools) or known bad (rootkits, password crackers, botnet files, trojan horse, encryption, steganography, key loggers etc.) programs/applications can be made. In addition to the known good or bad files identified in such a discovery, files containing proprietary data could be identified based on the hash files, known file names or known partial hashes.

One of the criticisms of utilizing autonomous agents that execute on the target platform is the potential untrustworthiness of the Operating System (OS) of the target.
Developers of autonomous discovery technologies certainly are aware of the threats posed by rootkits and other malicious code that can intercept OS calls and circumvent the discovery of hidden directories or files.

Without revealing the specific details of the countermeasure that developers employ to overcome these hooks, it is safe to say that self-inspection of the operating environment is critical to effective autonomous hashing software. This implies that the software must perform a thorough inspection and determine whether core API calls that will be used can be judged safe.
In addition to trustworthiness concerns, there is anxiety over agent modifications of target evidence that would bring into question the efficacy of the discovery in court. This is a valid concern, and the responsibility of those engaged in the development of such agents must be considered from the top down.

For example, great care must be taken to audit every operation and potential modification that the agent may cause. In addition, time stamping (from a trusted source) should be included in robust solutions in order to prove the exact time the “snapshot” of the file system was taken and when collection of the hash values occurred. Since the target machine is running before, during and after the discovery, at the very next moment the file system is likely to have changed – this is especially important when collecting hashes across multiple targets potentially existing in differing time zones.