University of College London campus. Researchers identified a number of promising machine learning techniques that may help improve detection of untracked or zero day malware. (University College of London)

An academic-private sector partnership reported favorable results from research exploring how machine learning models could be used to improve static malware analysis to better detect zero-day exploits and untracked malware.

The research was conducted via a four-month partnership between doctorate students at University College of London’s Centre for Doctoral Training in Data Intensive Science and U.S. cybersecurity company NCC Group. Students and researchers set out to develop a machine learning model capable of examining Windows binary and determine if it’s malicious. They used more than 74,000 malware samples and another 32,000 benign samples for multiple Windows operating systems to train a number of models to spot subtle differences in binary characteristics and identify malware from legitimate code.

The project set out to find alternatives to the two most popular forms of malware detection – static and dynamic analysis – both have limitations or workarounds that threat actors can use to evade notice. While dynamic testing code in a sandbox can allow researchers to observe how a suspicious program interacts with a system or network over time, they’re also resource intensive and threat actors are increasingly adding components into their malware to detect these virtual environments.

Static testing can take advantage of the vast ecosystem of malware samples and detection signatures collected and published by threat intelligence organizations, but malware developers have built in ever more sophisticated code obfuscation techniques and such analysis performs poorly for zero-day exploits or previously untracked malware. While more advanced analyses can pull in other data to compensate, this too winds up being too data and resource intensive for many organizations.

It’s on this second front that researchers focused, looking for ways to leverage machine learning in static analysis to improve the detection of new malware or zero day exploits.

For example, the researchers found ways to extract metadata from binary code by leveraging Portable Executable file formatting. The researchers focused on Portable Executable files for Windows operating systems, (which they say make up more than half of all files that are submitted to Virus Total, a popular website often used to analyze and cross reference suspicious files or URLs with signatures from dozens of threat intelligence and antivirus products.

This data is both informative as to how the program is designed to execute and difficult for a threat actor to manipulate or obfuscate. Other features, like the sequencing of bytes, control flow graphs and API calls can also be fed into a detection model.

“From this we conclude that PE headers with [open-source software library] XGBoost or other tree-based ensembles… provide an excellent method for filtering malware,” wrote University College of London doctoral students Emily Lewis, Toni Mlinarevic, and Alex Wilkinson. “A limitation to bear in mind for PE metadata models in general is that they rely on valid PE headers being available for each sample which is not always the case.”

The results, particularly the models that relied primarily on extracting data from Portable Executable formats, were promising though not foolproof, scoring between 97 and 98% accuracy in precision and recall. Other models scored in the low to mid-ninetieth percentiles, though researchers warned that the imbalanced dataset they relied on, containing twice as many malicious samples as benign ones, are probably inflating the overall percentages.

The models also work better at identifying some malware families – like Lamar, CRCF and DownloadGuide – than others, where performance “spans from good to poor” but ultimately show improved detection across a broad spectrum of malicious software. The authors argued that “the near perfect classification of some of the families demonstrate the high discriminative power that can be achieved by representing binaries with graphs.” Some of the successful detections were on ransomware samples, and the researchers believe the method could hold promise for improving detection and mitigation for future ransomware attacks.