Security Event Management (SEM) vendors often talk about scalability when addressing the degree to which their products can accommodate growth in a customer's increasing network and/or security requirements.
When talking about scalability in this context, one must also discuss SEM credibility, specifically, the reproducibility of the results. For a product to be truly scalable, the specific SEM results must remain unchanged as the product "scales". While this sounds obvious, it may not always happen. This article explores the concepts of scalability and credibility, and why these are factors that need to be considered evaluating an SEM product.
An SEM tool's "value add"
We first need to discuss what it is that makes an SEM tool valuable to a user or company. Without an SEM tool, the majority of a security analyst's job is to look at logs from security devices. It can become extremely difficult, tedious, and error prone, (not to mention impossible) if someone has to traverse numerous lengthy security logs, 8 hours a day, 5 to 7 days a week.
Tools that aggregate or collect data from all of the security devices on one network are a good first step in helping analysts to review security data: by combining data from all the devices into a central location, these tools eliminate the need for security staff to be looking at multiple logs. The problem with aggregation tools are - where do you look for problems? You either have to do this manually (in which case you are pretty much back to "square one") or you have to query a database to figure out what's going on from a security perspective. SEM tools that add analysis and/or correlation to automatically pinpoint where to look for problems while aggregating data provide a considerable leap in productivity in that they search through the aggregated logs and seek out indicators of "bad activity" for the security analyst.
These tools don't need a break, they work 24 hours a day 7 days week, 365 days a year. These types of tools maximize a security analyst's time because then he only has to focus on events that are worth looking at, rather than searching for a needle in a haystack.
Definition of scalability
While some vendors do a good job of explaining scalability in their product literature, others use the term without really defining it in the context of their product or architecture. According to Webster, we find the definition of scalable to read:
Scal.a.ble adj. that which can be set, regulated, or adjusted according to some measure or need.
For the purpose of this paper,
scalability is defined as the ability of an SEM product to adjust to increased input data while continuing to produce timely and correct output.
In other words, a scalable product can be adjusted to network growth or increased network traffic.
Vendors talk about accommodating wire speed, or, in some cases report that they are "infinitely scalable". This means that the vendor is laying claim to an architecture that can accommodate a continuous maximum (or, in the extreme case, an infinite amount) of input data without adverse affects on the timeliness or correctness of the resulting event correlations.
An SEM tool that is sufficiently scalable can handle twice your normal data capacity out of the box should be able to process all this data in a reasonable amount of time without dropping any data received or incurring a significant delay in processing. A product that is scalable to twice your normal capacity can be adjusted, either through the addition or more hardware or software or through the 'tweaking' of its architecture to handle twice your normal capacity. A product that is infinitely scalable should be able to "grow" to accommodate an infinite increase in network size or network events. If that last part sounds like a tall order - that's because it is. The buyer of products that make this claim should be aware that this is a promise that the vendor probably can't keep. So how does an SEM product "grow" to accommodate increased input? Since the performance of all software and hardware has finite limits, an as-purchased SEM tool can only be asked to accommodate a finite degree of increased demand.
Unless you have a small and static network, you will eventually need your product to be scalable. Additional software processes will need to be added, and (here's the hard part) all of the new software processes will need to work together, exchanging information in a timely way. In the case where a vendor provides an appliance, it can be achieved with a single hardware device being replaced by a larger, more powerful device or it can be achieved by the introduction of multiple hardware devices that work together.
This sounds simple enough, and in simple cases, it is. In all cases, however, it is important to note that a product that is truly scalable and accurate will provide true insight into the security health of your network. An SEM tool that isn't scalable will produce inaccurate results and will force the user into digging through logs, thus providing only the functionality of an aggregation tool.
Definition of credibility
To date no SEM vendor has discussed how the accuracy of their product is affected when the limits of scalability are approached and/or exceeded. Let us turn again to Webster, who defines credibility as
cred.i.ble adj. Offering reasonable grounds for being believed.
For the purpose of this paper
credibility is defined as a characteristic of producing the same SEM results, regardless of the number of processes or devices that are involved in the computation, analysis, or correlation. (It is only through this characteristic that we will have reasonable grounds for believing the results of event correlation.)
Setting the stage
For example, let's assume that you are given the problem of adding three numbers within five seconds, each is to be provided to you on a separate piece of paper given to you by your friend Jane. Jane hands you the 1st number, 8, at time 0, then a 2nd number, 6, one second later, and finally the 3rd number, 13, two seconds later. You can now add the 3 numbers and get the result 27, and if you do so instantaneously, you will have a two second to spare. The next time, your friend John joins the party and is given the 2nd piece of paper. Jane hands you the 1st number, 8, at time 0, then John tries to hand you the 2nd number, but drops it, in the mean time Jane hands you the 3rd number, 13, one second later. You have to add the 3 numbers but you only have 2 - what do you do?
Time is running out... You add the numbers you have and get the result of 21 and at the very same time you do your arithmetic, John finally hands over number 6, but you can't do anything with it because your time is up. In the following paragraphs we will show how a variation on this same problem may apply to SEM products.
Where the problem lies
Below, Figure 1 shows a simple configuration of an SEM tool, with one correlation engine and one SEM agent (the agent performs data collection and preprocessing for the engine). We show this SEM product, deployed on a simple though typical network, with security devices such as IDSs deployed throughout the potentially vulnerable parts of the network.
Figure 1 (click image for larger version): An example diagram of a typical network with Intrusion Detection Systems (IDS) monitoring Untrusted, Trusted, DMZ, and Database networks. The IDSs, firewall, routers and database server(s) are sending security log data to the SEM product. The SEM product in this diagram is configured with one input/collection agent accepting data from all security devices and one correlation engine.
Let's say that a company has been using the SEM configuration above and has been getting good results from its SEM tool. Why would this company need to worry about scalability? The reasons are numerous, for instance, they might in the near future do any of the following:
- Upgrade the IDS devices from 100 Mbit to 1 Gbit because of an infrastructure upgrade.
- The IDS vendor might triple the number of signatures that the device responds to.
- The network traffic might quadruple.
- Upgrade the internet connection from 1.5 Mbit to 45 Mbit for a new web server application.
- Upgrade the firewall, which might then produce 5 times more log data.
- Increase the scope of the security policy, thus logging 10 times more events.
- Grow so fast that two additional networks identical to the network shown in diagram 1 are being added to the corporate infrastructure.
Using our previous definition from Webster, we generalize the above possibilities and say that the amount of input is "being adjusted according to some need". Per our scalability and credibility definitions, the output of the SEM device must correctly accommodate the increase in input data and must be able to sustain this increase permanently while producing accurate results. It sounds simple. Let's see if it is.
When encountering such a situation in a customer environment, the SEM vendor will typically suggest adding additional modules/processes/devices to the SEM solution in order to handle the increase in data volume. The solution to the problem, as shown in Figure 2, is basically to divide and conquer by adding an additional data collection agents and/or correlations to help share the load. This approach is seemingly obvious and innocuous on the surface, yet it can lead to surprising results.
Figure 2 (click image for larger version): An example diagram where the SEM tool has been scaled to accept an increase in security event data. The SEM product is now configured with two input/collection agents, each accepting data from a subset of the security devices. Both agents send data to one correlation engine.
Let's look at one example of what might happen. Let's assume that the input has been divided among two SEM agents, as shown in Figure 2. The SEM devices reporting to Agent-2 are receiving normal levels of input. Agent-1 has received a deluge of data, and is taking longer to format and pass on the data from its part of the input stream. Related, simultaneous events in the two halves of the input stream have become separated by, let's say, five minutes, or thirty.
Either of these delays can be realistically envisioned, and the result is that the correlation engine is processing simultaneous events separately, and is thereby being denied all the relevant information. It is clear that such an occurrence can easily lead to the misdiagnosis of security threats.
A house of cards?
In the previous example we saw an example of what an overrun process that drops as little as one critical event can do to SEM credibility. There are other factors that affect the delicate balance that is required for an SEM tool to maintain its credibility.
Time is one such factor. If the input/collection agents that are accepting and time tagging the data aren't properly synchronized, the data cannot be credibly correlated. This is because time is in fact a major component in determining how related events are correlated. If the data is time-tagged properly, yet there are internal processing delays in the data path, the data may not get correlated by the time the relevant analysis occurs. If the data becomes stale or is aged out (because the engine is receiving so much of it), much like in the example of John and Jane dropping numbers, then that data will not get correlated when related events finally arrive at the analysis process.
Scalable SEM tools are also sensitive to network delays. These may cause data that is delayed in the network to get timed-tagged too late and result in the data never getting correlated with events that are related to it (and time-tagged properly) at all. If there are many SEM tool appliances that are all cooperating in receiving, analyzing, and correlating data, then this problem will increase exponentially.
Another important factor that will affect credibility of the correlated results are the effects of users interacting with the SEM product. SEM products may be bundled into appliances that do everything: collect data, correlate data, display data, and archive data. All these processes and functions all contend for computing resources on these appliances. How will two or even three users interacting with the system simultaneously with these processes affect the performance of the appliances? What happens if two users are each dissecting a security-related problem submit ad-hoc reports which may be traversing thousands in the database of records while the appliance is being bombarded with a flurry of data or worse being DoSed (Denial of Service)?
These are some of the difficult issues that SEM vendors are not yet addressing. Until this happens, credible SEM results may continue to require a predictable environment that is always attainable in large corporate networks.
Scalability at the correlation engines
Achieving scalability at the input is a relatively easy problem to solve, since the sole purpose of the input process is to accept and time tag the data and move it on to the analysis (correlation) path. For this reason, divide and conquer works relatively easily at the input. However, achieving similar scalability at the analysis process(es) or appliances is considerably more difficult: analysis that has been "divided" may or may not later be "reunited", depending on the various delays that are manifested in the network and in the various SEM processes. The problem is that neither you nor the SEM process can know whether or not the analysis could/should have been divided (and/or reunited) until after the analysis is completed and it becomes clear that a critical network intrusion has gone undetected.
In a scalable configuration where more than one correlation engine is used, time is a perishable resource with respect to the analysis of the data. Recall the dilemma of John and Jane. With any delay at one or more of the engines, the data may not get correlated properly. Independent of architecture, as the events per second increases, so does the risk of the SEM tool losing its credibility. Why? Because the engine has to process a lot more data, and it has to do so much more quickly. Any delay (as minute as it may be), of any form, in any of the data paths, will affect how accurate the results will be.
So how should you test your SEM product?
Determining whether an SEM tool is scalable and credible is not easy. The user must expose the SEM tools through rigorous testing during the evaluation. Testing should be performed in two phases; for verification at slow input rates and scalability at high input rates.
The user should have a slew of excellent well-planned, well-thought out test procedures to ensure the SEM tool produces the proper results. These procedures must produce specific input events with expected output generated by the SEM tool. Once the SEM tool passes all these tests, the user can then test scalability and credibility. To test scalability the user must know how many events per second that are currently being generated and the SEM tool must be able to tolerate an increase in this number by a factor of 10 or 100 for growth (or by a factor of 1000 for the truly cautious souls).
Scalability testing should include tests for both input scalability and correlation engine scalability. Input scalability testing will involve scaling just the input processes while maintaining one correlation engine. During this stage, the user will want to increase the number of security events that are both relevant and irrelevant so that the input processes are tested with respect to how they handle these types of events and how it affects performance as well as maintaining credibility. Once the SEM tool passes this stage, scaling the correlation engines should be tested to ensure that credibility is maintained. Another set of test procedures should be duplicated in a production type environment to simulate normal and abnormal traffic patterns. This is crucial to test because it will show how sensitive the SEM tool is to timing issues.
Another important test that is critical to the normal operation of an SEM tool is how sensitive it is to DoS attacks. DoS attacks are important in determining if the SEM tool input processes are prone to timing problems, which in turn will ultimately affect credibility. DoS testing will also show how well the SEM tool credibility holds up under extreme stress conditions.
If all of this sounds complex - that's because it is. If you do not have in-house expertise/analyst to develop these test procedures, then the SEM vendor should provide it along with the data or with the ability to generate the data.
Scalability is one of the most important factors in making a purchasing decision when evaluating SEM tools. An SEM tool that is truly scalable will enable a company to maximize its use and allow the use of the tool to grow as security policy requirements, network traffic, and security events inevitably. If an SEM tool is not credible, you the user will get burned because it will produce results that are not accurate or worse -- having your network compromised without so much as an alert. Credibility will be sensitive to increases in events per second, so it's imperative that the user test credibility at slower data rates as well as at speeds which exceed normal security event rates on your network. This is vital to ensure the SEM tool maintains credibility. During evaluation the user should project his security events per second to grow tenfold to a hundredfold to ensure that the SEM tool will scale as the user's requirements grows.
SEM tools can be expensive - and scalable SEM tools are really expensive. The SEM tool you select will determine how much time and money you're going to save if it works as advertised. Conversely, the wrong choice can hinder your ability to monitor and protect your network: if the SEM tool is not scalable, until the day that it proves itself so with the inevitable missed intrusion, it will generate a false sense of security among its users.