'It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.'
— Sherlock Holmes
Threat intelligence teams are increasingly integrating data science perspectives into their workflows. By providing these teams with new insight into the threat intelligence landscape, data science can help analysts make more informed decisions. While artificial intelligence (AI) cannot replace human analysts, these technologies will continue to be a crucial tool in the cyber intel battlefield.
Today's threat intelligence analysts are often tasked with sifting through countless datasets from numerous feeds. In many cases, analysts' expertise eventually enables them to derive useful insights from all of this data. They might be able to, for example, connect the dots between an actor on one feed and the actor's persona on another, or identify a singular data point revealing a malware campaign that has been targeting organizations across 10 different verticals. However, gleaning these insights by manually combing through massive amounts of data is neither efficient nor ideal.
Any domain where analyses, pattern detection, or prediction can be automated will likely benefit from employing a team of data scientists to get the job done. There are several steps in the threat intelligence data science workflow that set the stage for understanding the role data science plays in this landscape.
Threat actors, whether they like it or not, have to create data points while carrying out their campaigns. Consider the actor conducting a campaign and spreading the malware through a botnet. Analysts capable of setting up a honeypot can create a collection mechanism to track every C2 command the actor makes.
Another more pervasive data point is text. Most communication between threat actors happens over the internet; whichever platform an actor chooses to communicate through is another avenue for data collections. A data science team will typically have an entire group of individuals focused on collection along with another group focused on data engineering. These data engineers help move the collected data, for example, from the honeypot to the data store in a structured format before preparing it for the next stage of automated analysis.
Automatically analyzing the tremendous amount of available data removes the burden from analysts. This means that an analyst can stop manually parsing through chat logs and instead, for example, rely on a process to alert them when certain keywords are mentioned. It is typically a requirement for the data to be collected in real time so analysts can trust that the analysis is up to date. Timing is also crucial because data store and data structure matters tremendously when considering the technical limitations involving query times and explicit data types.
Developing automated analyses is typically dictated to data engineers or those who understand how to process the data to provide insightful analytics. Depending on the extent of the analysis, this process also requires a level of domain expertise. A moving average report can reveal indicators of the present moment, but producing predictive or classification reports requires machine learning.
There is a risk associated with letting a machine learn on its own, however the risk comes with how the results are interpreted. By definition, machine learning trains computers to interpret data without being explicitly told how to process the data. When applied to cybersecurity, these techniques can produce far greater insight than basic automated mathematics or keyword searches can.
As a result, data scientists developing machine learning models for threat intelligence aim to rely on the subject matter expertise of the analysts with whom they work. Accurate, precise, and specific models require an integrated collaboration among analysts and machine learning engineers because domain expertise is crucial when interpreting results. Teaching a machine to understand certain facets of the cyber threat landscape can aid analysts in understanding threat actors, how certain trends and threats are changing, or even when an event may occur.
This section explores techniques data scientists employ to produce automated intelligence or machine learning analyses in the cyber threat landscape. Depending on the problem, scientists may use a different set of mathematical tools to find a solution. Identifying and implementing the proper tactics and techniques for solving any given problem is a balancing act that requires the right experience and expertise.
Typically the simplest tools are the ones that solve the most problems. A simple moving average can be deployed in a data pipeline to understand which direction a time series of threat points—such as mentions of a certain malware in a forum—have moved in the past. Additional analysis can then help analysts to identify and understand the result of a cyber event.
An example of this is a cumulative probability distribution of botnet prices: a simple histogram for counts of the time of day that a certain platform experiences the most chat activity from threat actors. These kinds of analyses, although trivial mathematically, are crucial building blocks to master before moving on to more math-heavy machine learning techniques.
In a domain such as cybersecurity where a prediction can potentially mean a cyber attack, it is crucial that analysts be prepared to step in when a model decides a threat point exists. The level of involvement varies depending on what kind of machine learning technique is producing the prediction. In an "unsupervised" setting, the interpretations of the model are projected into a latent space and therefore there is not a "yes/no" answer, but rather a depiction that requires interpretation. In this process, analysts are required to interpret what the unsupervised labels, classes, or groups of data points are trying to say.
Once analysts determine these labels, unsupervised models can then be used to produce feature representations of data and even be fed into supervised techniques. Supervised techniques make predictions in a discrete target label space or probability distribution; therefore, the results do not require immense interpretation. Machine explainability is still crucial in this respect. In other words, it's important to understand mathematically which pieces of information influenced a given result. Predictions typically do not cause an immediate reaction without analyst insight. These predictions and classifications should be interpreted as constructive data points for analysts to further investigate; they aim to decipher the threat points from the abundant noise of the deep and dark web.
Natural language processing (NLP) techniques are seeing increased adoption among threat intelligence professionals because of the amount of text data produced daily. Cybercriminal enterprises can exist in separate corners of the world and still conduct lucrative operations through online communication. For instance, some of the communication applications preferred by threat actors facilitate the exchange of billions of messages every day. This avalanche of text and document data is another avenue for machine learning analysis in the form of NLP.
Data scientists employ several NLP techniques to support threat intelligence operations. Topic modeling is one such unsupervised technique that strives to cluster text based on its topical analysis, typically by using an algorithm called latent dirichlet allocation (LDA). After going through a labeling process like the one described previously, a data scientist can leverage this technique to investigate how topics are changing over time as the cyber threat landscape grows and evolves.
Knowledge graphs aim to visually represent an analyst's understanding of the cyber threat landscape. These graphs typically have edges and nodes similar to those we see in graph theory ; the nodes represent entities, events, or ideas, and the edges depict the relationship between them. Data scientists deploy knowledge graphs to represent relationships in an ecosystem. Information is stored in the form of "this topic relates to this other topic through this".
This technique allows data scientists and subject matter experts to ask questions like, "what kind country is this threat actor from?" or "what is the relationship between these two IP addresses?" Although complex, this graphical representation produces simple answers that can support and develop the factual representation of an ecosystem. The answers produced by a knowledge graph are not predictions, but connections representing the truth of a data source. It is a data scientist's job to feed these graphs with new data and to always refine and remodel node connections based on analyst insight. This process marks the beginning of a holistic AI understanding of dark web data sources.
There is no arguing that data science provides valuable insight into the cyber threat landscape. The tools that data scientists create, manage, and iterate on can lead to insight much faster than an analyst is capable of on their own. It is easy to envision a future where analysts utilize AI models to quickly investigate activity on millions of different potential dangers and predict with confidence what these observations mean to an enterprise or institution. The growth AI has seen is partly due to an increased amount of data produced every day, and cybercriminals are not exempt from that truth. The more activity we see from threat actors, the more information data scientists have to eventually stop them. Welcome to the age of AI and analyst symbiosis.
Lance will be teaching a workshop on Scalable Threat Intelligence Design at InfoSec World 2018. Attend this two-day class to learn processes of threat intelligence development.