Risk management

The continuity lessons of the Facebook outage

In this photo illustration, the Facebook logo is displayed next to a screen showing that Facebook service was down on Oct. 4, 2021, in San Anselmo, Calif. (Photo Illustration by Justin Sullivan/Getty Images)

Midway through Facebook's prolonged outage on Monday, reports began to emerge that the problem may have been exacerbated by a series of circular internal dependencies. When services went dark, the things needed to prepare those services also went dark. It was, by most reports, a business continuity ouroboros; a snake eating its own tail, deleting its own DNS records and locking its own recovery technicians out of the building.

Business continuity planning is critical to security, even if, as of this writing, the Facebook outage was not a security issue. One key takeaway for anyone involved in risk: be aware of those internal dependencies.

The Facebook outage was almost definitely not due to a breach. The service appears to have faltered after a BGP misconfiguration snowballed out of control. That snowball was exacerbated by single points of failure along Facebook's self-sufficient processes.

The New York Times' Sheera Frenkel was first to report that the teams sent to survey the damage were locked out after badges stopped working; the internal security system had been dependent on the Facebook network remaining operational. Facebook owns its own domain registrar, which the BGP glitch knocked offline, resulting in a brief moment where the Facebook domain name appeared to be available for purchase.

It is easy to make fun of Facebook (and people did just that). But these kinds of circular dependencies are not uncommon.

"The Facebook outage is probably one of the most spectacular examples we've seen. But there is a common pattern where an incident affects tools that are required to resolve it," said Courtney Nash, research analyst at chaos engineering firm Verica. "My favorite pattern on this one where the status page is contingent on whatever went down."

Verica released the early results of a new, comprehensive software incident response databasing project on Tuesday. The project, called VOID, drills down on how interdependent problems are more complex than having a single root cause.

That holds true for the kinds of internal dependencies that can trip up companies in continuity planning. In the broadest sense, it can be very difficult for companies to understand all of the dependencies without testing to see what happens when problems are introduced. Chaos engineering is the process of flicking switches to see what happens as various internal components go dark.

"You can't know all of these dependencies," said Nash. "No one person can hold them in their head. It doesn't matter how many people you get in the room, you're not going to write them all down. And so first and foremost is accepting that reality and planning for the failure."

While some problems can be hidden due to the complexity of the issue, problems can also be borne of risk managers wearing rose-colored glasses about internal dependencies.

"Companies don't like to think of themselves as that lynchpin, as the cause of that disruption. We like to think about disruption happening from external sources or other systemic risks," said Alla Valente, senior analyst at Forrester.

The self-sufficiency from controlling a dependency, rather than outsourcing it to a third party, can sometimes give companies a "false sense of security" despite issues they would detect with a third party vendor, said Ron Brown, who heads the security solutions practice at GuidePoint Security. Companies may be attuned to the risk of a vendor hosting a critical restoration tool in the same city as the company itself, but not as attuned to the risk of the company hosting it itself.

The solution to being blind to internal threats, said Brown, is simple: get an outside opinion of dependency risks regularly.

In the end, the struggles Facebook may have faced can be emblematic of a core dilemma in designing for continuity and resiliency.

"People think they understand how these systems work better than they do, and yet simultaniously, they're incredibly good at keeping them up and running the vast majority of the time," said Nash. "That's the biggest paradox I think we're seeing from VOID."

"I mean it's kind of crazy that these things are able to work as often as they do," she said.

prestitial ad