Unsupervised anomaly detection giving security pros a leg-up on bad online actors

Cybersecurity attackers in 2017 have it easy. Powerful tools for abusing online services are easy to find online for free. Compromised Android devices provide attackers with a vast reservoir of high-reputation IP addresses for hackers to exploit.

A more privacy-conscious public is leveraging privacy-protecting devices and browsers and IP-masking services that can hide an original IP address so that it can't be traced. That makes it harder to distinguish between attackers and legitimate users who are justifiably trying to protect their personal information. These new, sophisticated, high-volume attacks can permanently damage your brand in mere seconds.

As online data theft continues, banks, merchants and Internet-based businesses must collect and analyze massive amounts of event data. This analysis usually needs to occur in real time. That means methods for proactively detecting fraud or other unsavory behavior most also work faster and differently. Common approaches to risk scoring and fraud detection, such as supervised machine learning or a rules engine, need the help of new methods – just as hackers are continually devising new ways to trick users and gain access to private data.

The state-of-the-art technology employed by defenders – supervised machine learning – is becoming less effective as attackers work faster, their tools get more complex and stolen IPs are easy and cheap to obtain. Supervised machine learning relies on labels to learn patterns of behavior that constitute fraud or abuse. For example, a supervised machine learning model may analyze properties of a login – the IP address, the device ID, the browser fingerprint, and its geographic location over time – to determine whether a user's account has been compromised. To train a model, you need examples of previous login events that are labeled as known fraudulent.

Getting these labels can be difficult, expensive and even impossible. How do you get examples of known fraudulent logins? You can have analysts manually label them (by “eyeballing” the audit trail after the login occurred), polling users, or using two-factor authentication to verify a legitimate login. Even if you choose to leverage these expensive techniques, you'll likely miss many fraudulent logins.

While these labels can feed into the machine learning system, the latency of those feeds is often too slow to stop attacks in real-time. In many anti-abuse applications, delays of a few seconds can be extraordinarily costly. For example, your account takeover system must identify a compromised account before an attacker can exfiltrate data (often within just a few seconds). It's getting harder to acquire labeled data all the time, as systems must analyze continually growing data sets with speed.

Enter a new method, labels not required

Unsupervised anomaly detection is a complementary technique that can help augment a rules engine or supervised machine learning system. Unsupervised anomaly detection examines features of a particular user or device's behavior and compares them against the entire population or entity's historical behavior.

Without any labeled data, the system can identify anomalous behavior by how far its features deviate from the expected distribution. These anomalies can be used to flag risky logins in real-time, serve as inputs into a supervised machine learning system, or reduce the cost of labeling data by identifying a small subset of logins that need to be labeled.

For instance, companies can perform unsupervised detection on an individual user first by scoring their activity based on factors like where an online interaction originated from, what information was given, and so on, and then assign a score for how unusual or anomalous it is based on historical behavior. Next, the system could compare that individual user score to the entire population of all interactions and assign a second score.

Let's say that both the user and the population score is high for anomaly, the anti-fraud system would block the interaction outright. Companies don't need a lot of data to get to that analysis, and they don't need a label. And, companies eliminate a common problem of false positives, which is especially important today when user behavior and hacker tactics are changing all the time.

Consider an analysis of the number of accounts created per IP address, compared against another data point such as the specific ISP. A company could analyze this data to see if someone is creating false accounts. Most ISPs will be in the middle of the bell curve for accounts created per IP address. So, by setting a threshold for number of accounts per provider, companies can focus on the outliers. They don't know whether those IP addresses are good or bad but they do know they are anomalous; further analysis may identify a bad actor.

A core benefit of unsupervised anomaly detection is that this method can withstand change. While people change their behavior frequently and other factors change, such as ISPs going out of business or hackers developing new ploys for deception, companies can always identify the outliers by reviewing data based on how abnormal it is compared to all other signals.

As with all tactics for security and anti-abuse, there is no silver bullet. Unsupervised anomaly detection is a technique that is helping fill the gaps of real-time security analysis of large, frequently changing data sets.