Part two: Blacklists, clustering and The Matrix

Following on from my previous discussion of the limitations of blacklisting in dealing with rapidly evolving threats and their symbiotic relationship with clustering technologies, let's lift the shroud of the clustering magic.

Clustering is one of those things that sounds more mysterious than it really is, but is much tougher to apply correctly than first assumed. It is a whole area of math in its own right — and is tasked with assigning sets of observations into subsets, so that the observations in the same cluster are somewhat related.

When implemented in an intelligent way, it seems almost precognitive in what it can do, reminding me of one of Arthur C. Clarke's quotes: "Any sufficiently advanced technology is indistinguishable from magic."

By way of explanation as to the role clustering (and blacklists) plays in pre-emptive threat detection technologies, I'll try throwing in a few marbles here and there — because the concepts are easier to understand.

Consider a team of threat analysts who are tasked with maintaining a blacklist of drive-by-download websites. They're supplied with a never-ending, but vetted list of candidate domain names with which they proceed to conduct assorted tests (e.g. Is the website available? Where is it located?) and apply various logical filters in deciding whether the domain name is really malicious and should be added to the blacklist.

But let's simplify things to marble terms. For example, consider a giant hopper full of marbles in which boxes of new marbles are added every day. Workers are automatically fed one marble at a time, and are tasked with picking out the flawed marbles.

The application of clustering techniques can speed up the task considerably. For example, if we introduce a machine that rapidly inspects the color of the marbles and places similarly colored marbles in to the same jar for manual inspection, the worker can run their tests against a subset of closely related marbles.

If color is a significant component in the flaws being observed, the worker may be able to inspect a small handful of marbles from a single color jar, rather than every marble in the jar, and be able to conclude whether the entire jar is good or bad — thereby successfully "characterizing " objects within a cluster.

Unfortunately there are a number of real-world problems likely to prevent the marble company from simply saying, "We don't need the workers anymore. All red marbles are flawed so we can automatically throw those away when we encounter them".

First, using coarse-grain clustering, the supply of marbles is continuous, but you'd expect the quality of the marbles to change over time. If all the red marbles are currently flawed, then it is almost certain that someone will eventually fix the manufacturing process so that future red marbles are of a higher quality.

Second, using finer-grain clustering, what constitutes the color "red?" There are a lot of different versions of "red" — strawberry red, ruby red, cherry red, crimson, etc. — and the machines that manufacture the marbles may alter the mix of the various ingredients over time that result in color changes. What about marbles that are completely the same color? If a marble is 60 percent red, 30 percent blue and 10 percent green, which bucket should it be placed in?

These attributes in marble production influence the types and ways in which the marbles are clustered together in to their respective jars. Since marble quality will inevitably change over time, a frequency in which manual inspection of the jars is necessary will need to be defined.

Since a specific color attribute can also be open to interpretation, decisions upon the rigor in which color definitions will be enforced need to be made. Being more specific about the color content of a marble will likely result in the need for more, but smaller, jars and also increase the overall number of manual inspections a worker needs to do to classify the contents of the jar as flawed or not.

This jar clustering process can also be improved by taking a closer look at some of the additional tests that the workers perform in deciding whether a marble is flawed or not. Weight, size and glass clarity may all be easily measureable attributes — with their own inherent thresholds — and separate machines could be added to the production line to perform these functions.

Clustering, in turn, allows you to automatically group together similarly attributed objects, and then decide whether or not you should do something with them without having to perform test against all the members of the cluster (e.g. jar contents).

For example, domain name attributes — such as when it was registered, the location of the name servers supporting the domain, the ccTLD registration details, past IP addresses associated with the server the name is pointing to, etc. — are useful in automatic clustering terms.

However, by themselves, the clusters that materialize are of little value in practical terms unless you can successfully label them — or at least figure out how close they are related to other previously characterized objects within a cluster.

And this is where blacklists (and whitelists), for threats such as phishing domains, come in to play, as they can help to automatically characterize objects within the clusters.

In practical terms, though, most organizations probably don't want to undertake their own clustering and manual labeling processes. For one thing, you normally need tremendous volumes of historical data. For another, the process of clustering can be quite complex and hard to scale, so you need pretty beefy systems to support it. And finally, clustering at this kind of level isn't an instantaneous process.

Luckily there's a pretty simple way of being able to take all this clustering goodness and apply it to security technologies that can be deployed within enterprise networks — without requiring the heavy lifting of running dedicated clustering processes — to pre-emptively protect against evolving threats. The secret lies in the creation and distribution of a labeled "map."

Just as peaks and troughs on a topographical map indicate mountaintops and valleys, and the distance between contour lines is associated with how steep and well defined a mountain or valley are, clustering maps provide the ability to take a previously unseen or unclassified domain name and rapidly figure out its malicious associations.

Just as with the marble example of having different suppliers and changes to manufacturing processes affecting the jars in which the marbles are clustered in and subsequently classified, the internet is similarly dynamic, meaning that clustering systems and the reputation systems that depend upon them must be fed continuous streams of data and their labeled objects within each cluster need to be updated constantly, and that the "maps" themselves are time dependent.

Armed with that insight in to clustering and the relationship between blacklists and reputation systems, I've found that the folks I speak with end up having a much better understanding of how it is possible to not only keep up with, but to pre-emptively detect and protect against, an evolving threat — assuming you apply the knowledge wisely.

Such techniques can provide an early detection system for suspicious domain names and IPs, even without the presence of an infecting malware sample (or botnet agent) that can be directly correlated with these network elements.