Reducing “MAX-TTR” and why it matters to your network

There are typically two costs to consider for both security operations and network operations: the capital expense of purchasing tools and the operational expense of employees required to drive the tools. Unfortunately, finance teams often fail to recognize that the number of incidents that NetOps and SecOps teams are asked to deal with is increasing, while operational budgets are remaining constant, if not decreasing. This situation is at best troublesome, and at worst a train-wreck waiting to happen. To put the situation into context, consider the investments organizations are making today and the impact those investments have on operational functions.

The tools that ops teams use can be broken down into two categories:

Tools that protect organizations from bad things from getting into the network – such as firewalls, anti-spam, IPS, and access control systems – are good for removing harmful stuff without anyone having to manage them on a day-to-day basis. Their downfall is that they only know about known-knowns.
Tools that detect bad things in the network include IDS, SIEM, advanced malware detection and DDoS detection. These technologies are important because they start to bridge the gap between the known-knowns and the known-unknowns.

From an operational standpoint, there's a “gotcha” with the detect category. For every tool deployed, there's operational overhead required to manage the tool's output. Organizations can't continually add capability without expanding the operational footprint, which typically they do not want to do. Here's the risk of having the two out of balance: While the new tools detect really important stuff, there's nobody to deal with the alarms. They are busy trying to work out whether the last alarm generated by the last tool was real or not.

The answer is not to reduce the number of detection tools in use, but rather to improve the efficiency of the engineers and analysts responsible for dealing with alarms. If the throughput of engineers per-hour can be increased faster than the growth in problems, then the status quo can be managed, but it requires a different capital investment profile than the one that most organizations are using today.

Analysts need a tool that does more with less, that dramatically shortens the time it takes to figure out whether the alarm they are looking at is serious and whether there's any action required. A new technology category is emerging to complement the protect-and-detect categories, designed specifically to help organizations improve analyst throughput by focusing on the response and root cause analysis workflow. Underpinning this new technology category is network recording or full-packet capture.

Currently, the IT investment profile for most organizations is 70 percent protection, 30 percent detection. In the future, that investment profile needs to shift to 60 percent protection, 20 percent detection and 20 percent response and root cause. The good news is that response and root cause infrastructure cost can be shared by NetOps and SecOps as the core functionality is agnostic.

The key to effective response and root cause analysis is accurate historical network visibility, which means network recording. If analysts can easily and quickly go back in time and see, at packet level, exactly what happened when an alarm was generated, they can determine quickly whether it's real, what happened and what to do about it. Time-to-resolution (TTR) is a function of historical network visibility. The more visibility that analysts have, the more work they can chew through in any given period of time. When the number of incidents increases and the amount of available analyst hours decreases, getting a handle on TTR is important.

Most organizations focus on their mean-time-to-resolution (MTTR). But the reality is that there is very little organizations can do to move the needle on MTTR. Reducing the time it takes to fix the average problem from four hours to three hours 50 minutes is irrelevant. A more interesting metric to look at is MAX-TTR or the maximum time it takes to solve a problem. That's where a real impact can be made relatively easily.

For most organizations, TTR follows a standard distribution curve where the majority of incidents are clustered around the four-hour mark, but there's nearly always a long tail of events that take days or even weeks to deal with, and consume large amounts of scarce operational resources. The ability to go back in time to the point that a particular issue was reported or alarmed and identify exactly what happened enables organizations to dramatically reduce the time to resolution on those long-tail issues.

Dropping MAX-TTR from 24 hours to four hours (or less) will undoubtedly have a profound impact on the amount of resources available to deal with problems. Also, by having true visibility, the quality of the remedies that engineers and analysts put in place can be dramatically improved. Root causes can be addressed, rather than just symptoms.

As soon as Capex spending on tools and Opex spending on heads get out of synch, the operational model starts to break down and organizations are exposed to unacceptable levels of corporate risk. The next wave of technology investment cannot be detection-related because organizations must invest in tools that improve the efficiency of analyst response to improve throughput per analyst/hour. Understanding your distribution of fault-resolution by hour highlights the opportunity to unlock significant savings in the long term that could enable existing resource levels to be maintained at least for a little while.