Venture Capitalists' (VCs) goal in life these days is to find ‘cyber security unicorns'. These are rare creatures. Much like a Cyndaquil Pokemon ( for those not acquainted with Pokemon species, these are rare in the wild), unicorns have common traits that must be exhibited in order to attract the attention of VCs.
1. Your company's product must reference the use of Artificial Intelligence (AI) and Machine Learning (ML), preferably deep learning. And for bonus points, require lots of data scientists.
2. The other stringent requirement is that you must never say you use Signatures, under any circumstances. Signatures are the reason traditional security products have failed, paving the way for the charge of the AI and ML unicorn cavalry who will save the day, without a single signature in sight. Most believe that the “S” word must be banned from all forms of communication. If your product does use "content updates," you must get product marketing to invent a new, cool, next generation-sounding name, that cannot contain any reference to signatures.
If you fail to provide either requirement, the VCs might simply ride their BMXs around the corner to the next AI and ML cyber security startup.
The Flaw in AI & ML-based Tools
Despite the guidance from the book, the real world is quickly uncovering that the latest AI and ML-based cyber security controls are far from a panacea to the malicious threat actor problem. It is being reported that flaws in the building of intelligence and the structuring of the learning process are creating an environment where old school “signatures,” albeit disguised with a next generation name, are being used to fill the gaps in AI and ML protections.
Humans, at the end of the day, feed data into the AI/ML systems. They have to make decisions about the initial training and classification of the data. Often the human decides if the training data goes into the good corpus or the bad corpus for the machine to learn about.
A source of data for many wannabe Unicorns is, amongst others, VirusTotal. In this example the human may set a threshold that files with positive detection of 30+ AV decisions are indeed malware and fit the classification to be added to the bad training corpus. In contrast, files that have been classified as known good, will become the opposing data training corpus. These data sets can then be used to teach AI/ML system to identify strong probability characteristics of good and bad files.
One of the strong indicators in known good-corpus datasets is files that have a code signed certificate. Code signing is present in a high percentage of good files, and missing in a high percentage of files that have 30+ detections of maliciousness. The Artificial Intelligence learns that it is highly probable that signed files are not malicious.
This machine-learned bias was subverted in a recent report http://signedmalware.org/, where a research team took known bad malware, gave the malware a certificate value, and managed to change the verdict from bad to good on many ML-based Anti Virus engines. This is an example of real world corruption in machine learning.
Fixing Polluted AI/ML Tools Requires – yes, the “S” word and (gasp) Humans
This exploitation of AI/ML feature selection is not easily fixed, and the only timely solution given the immediacy of malware is the much maligned signature update - a simple fix until the ML can be retrained. It appears that when data is biased, polluted or abused, AI/ML systems, ironically, are signature-dependent, too.
At the other end of the scale, you're not going to catch a lot of sophisticated malware in a timely manner if you're waiting for 30+ detections before you train your systems on it. So, what is the right number? Files that have single digit detections on VirusTotal often represent both heuristic detections of bad files and false positives of good files. The heuristic detection can be based on the identification of packers being used, rather than an actual feature of the contained file, further risking the polluting of either data set.
These two examples, of heuristic detections of packers, can result in a good file being classified as bad, and malware with a code signed value being classified as good, result in training AI/ML systems from day one on a polluted dataset thereby eroding accuracy and confidence in the probability of verdicts from ML features.
How do we fix this?
Humans and signatures. Just don't tell anyone.
Artificial Intelligence and machine learning solutions are appealing because they potentially solve the problem of the shrinking pool of human talent in the security process. Additionally, they are attractive to companies because the cost of running a security operation is reduced. However, without better classification and learning processes, AI and ML-based security solutions are bound to the fate of many previous silver bullets. They sound great, but when you try to use them, you find out the reality is eroded by false positives and false negatives.