Why incident post mortems?

November 14, 2006

There are several kinds of incident post mortems, ranging from quite casual to very formal. At the casual end of the spectrum, we find everything from hallway discussions to very brief meetings, none of which really address the underlying issues. At the formal end we find very rigorous investigations and in-depth analysis that may be beyond the resource capabilities of many organizations. I like a generalized approach that I refer to as "structured." Structured post mortems work well for most organizations. They allow acceptable fact finding and lend themselves to lessons learned and remediation quite nicely. The simplified steps in a structured post mortem are: collecting evidence; analysis of individual events; event normalizing; event correlation; timeline analysis; chain of evidence construction; corroboration; lessons learned and remediation.

This approach requires good logs, interviews to ascertain timelines of events, and input from team members who know what they're looking at. Both the timeline analysis and chain of evidence construction are very important to a good remediation plan.

Timelines tell us what happened and when. It also reveals how our teams responded and may reveal opportunities for additional training, new procedures or improved monitoring tools. Chain of evidence tells us what the chain of events was from the perspective of causation. This allows us to look at where we can break the chain, rendering the entire incident harmless. This is the simplest form of temporary remediation while a broader fix is implemented.

Post mortems are critical to the smooth operation of our networks. The only question is how deep you want the post mortem to go. That is a matter of resources vs. system criticality.