Part one: Blacklists, clustering and The Matrix

How can you tell whether the computer you're about to connect to is malicious or under criminal control?

More importantly, how can you tell "pre-emptively" if this happens to be the first time you or your organization has ever even considered connecting to the computer?

These are the kinds of questions I'll get asked on a weekly basis as organizations try to fathom the techniques and technologies available to them that are capable of thwarting today's evolving threat.

There are actually many different methods of "pre-emptively" detecting (and blocking) malicious computers — the most successful of which stem from blacklists or some kind of reputation system. Unfortunately, negative preconceptions within the industry often undermine their technological value.

To many people, the term blacklist is associated with a backward and outdated method of approaching a dynamic threat. Sales folks are loath to use the term with prospective customers, and CIOs, CSOs, CISOs, etc. are prone to associate technologies that utilize them as dated and incapable of dealing with the pace of an evolving threat.

Blacklists, despite being widely deployed, conceptually struggle to keep pace with the agile nature of many of the more advanced threats — particularly drive-by malware installation and botnet command-and-control (C&C) channels — and, as a result, are often dismissed out of hand.

The tarnished blacklist name has meant that dynamic reputation systems have come to the fore as a critical pre-emptive technology touted by vendors and licensed by the score — even though, in many practical terms, they're so similar to blacklists that they can often be considered one and the same. I suppose it is a break from calling things a "high-fidelity dynamic blacklist."

At their core, reputation systems are a technology that mix together multiple blacklists (each focused upon a particular malicious attribute — e.g. a list of known spam IP addresses) and other historical attributes (such as frequency of badness), to provide a prediction as to whether the other computers in (network) proximity are also going to be of a lesser or greater threat.

IP-based reputation systems have been an integral component of anti-spam technologies for close to two decades now. Consequently, reputation systems are often dismissed as old-hat.

A more recent problem for business executives and security professionals tasked with deploying pre-emptive technologies (capable of answering those first couple of questions at the beginning of this blog) lies with the confusion caused by vendors that feel the need to "jazz up" their reputation systems by throwing around all kinds of crazy mathematical (or roughly mathematical sounding) terms.

But a question still comes to the fore: Does the industry really need dynamic domain name and/or IP reputation systems to keep track of evolving threats? The answer is absolutely, and the reasons are very simple.

Unfortunately, on a daily basis, you can only analyze so much malware (to get new domains for your “binary” domains blacklist) and you can only probe so many domains to keep a blacklist fresh.

But one thing is certain: You will never be sure whether you've identified all of the badness that could harm your enterprise network. A simple update to a botnet agent's configuration file or the purchase and subsequent activation of 20 additional C&C domain names will instantly (and negatively) affect blacklist and signature-based detection rates.

Machine-learning approaches, like clustering, will characterize a detection decision based upon on the “behavioral analysis” of the potential malicious domain and/or IP address. Basically, if a new domain name clusters together with 50 known C&Cs (and assuming that the statistical features characterizing domain names make sense), you don't actually need to execute the malware to arrive at a conclusion that it is probably malicious.

Based on clustering, and how mathematically close to the already known and previously categorized fraudulent elements it is, it is possible to infer its badness with a certain degree of confidence.

It is clear that clustering and other machine learning techniques provide the ability to move away from a Boolean (Yes/No) detection decision and toward a detection confidence score (e.g. expressed as percentages).

Being tasked with breaking down the barrier of previously instilled nonsense sales terminology, while avoiding "blacklist" keyword mines, isn't a task I particularly relish when engaging with a prospective customer for the first time. That said, you can have a little fun with it, and in some ways, it is a little reminiscent of the scene out of The Matrix when Morpheus offers Neo the choice between the red pill or the blue pill.

Distilling these word-soup black box technologies down to more comprehensible techniques and truths tends to bring its own business rewards.

Without wasting time trying to explain the intricacies or technical nuances of "machine learning", "neural networks", "statistical pattern recognition", etc., I've found that a simple discussion of what clustering is (and how it is efficiently applied to the problem space) tends to get the job done more or less — and undoes much of the confusion caused by nonsense terminology.

"Clustering," it would seem, is a key knowledge nugget in understanding how blacklist and reputation systems are effectively leveraged for pre-emptive threat protection.

While the strengths and merits of blacklist development are often described as an art, clustering is often perceived as magic.

When implemented in an intelligent way, it seems almost precognitive in what it can do, reminding me of one of Arthur C. Clarke's quotes: "Any sufficiently advanced technology is indistinguishable from magic".

In part two, we'll lift the shroud of clustering magic, expose the white rabbit, and come to understand just how the technology and mathematics make pre-emptive protection possible