It's an essential question for security teams following a cyber attack: Where did the threat originate? In the days and weeks following the WannaCry ransomware attack—which swept through 150 countries, infecting hundreds of thousands of computers—reports emerged pointing to various potential actors. But none of the insights came soon enough to help defend against the attack. Unfortunately, the type of analysis used to derive them just doesn't work that fast. The good news is there are other approaches that do.
Dynamic analysis of WannaCry and its possible origins required hours of manual code inspection. As a result, the first clues took several days to emerge, and further insights took weeks. The problem is the process entails manually comparing thousands of code segments from dozens of known malicious actors. As the volume of new malware threats grows (the AV-TEST Institute reports registering over 390,000 new malicious programs daily), that problem is only going to get worse. Dynamic analysis simply can't scale to compare code quickly enough to identify the origins of a new piece of malware in a timely way.
Dynamic analysis can help determine the runtime effects of a piece of malware, but with tools for sandbox detection and evasion becoming increasingly common, its value is limited. Besides, knowing what a piece of malware does won't help with file similarity analysis, as there may be dozens of ways to achieve that result. Comparing file hashes has never really been useful, either, since attackers routinely leverage code polymorphism to ensure each piece of malware has a unique hash. What about fuzzy hashing as a tool for file similarity analysis? It's increasingly being used to measure how similar two binaries are. The challenge is fuzzy hashing tools like ssdeep are applied to the entire file and can't catch similarities more complex than one file being related to another.
But what if fuzzy hashing could be applied to pick up code similarity at a more granular level? That thinking has led RSA to a new static analysis technique for detecting complex similarities and, moreover, identifying similarities from multiple pieces of malware. Through this approach, we can create a malware genome, if you will, that provides an understanding of how malware evolved, even when it's an amalgamation of multiple malicious tools. Beyond mapping out code capabilities, this genealogy may shine some light on the malicious infrastructure and exchange of tools happening on the attacker side.
As a service to others engaged in threat investigation, we're freely sharing the tool we've been using to explore this approach. Our hope is WhatsThisFile will help defenders evaluate unknown files faster, discover similarities to known malware and quickly gain the insights needed to better defend their enterprises.
By Kevin Bowers
Kevin Bowers has been with RSA for 10 years and served in a number of research roles in that time. He joined RSA Labs after graduating Carnegie Mellon University with a Master's in Computer Science with a focus on Security and now leads the Data Science Research team at RSA. By analyzing large volumes of data, and working with security experts to identify relevant features, data science can pinpoint malicious or anomalous behavior and provide enterprise defenders a more accurate and focused alert from which to begin their investigation and remediation efforts.