The massive outage that took down Facebook, its associated services (Instagram, WhatsApp, Oculus, Messenger), its platform for businesses, and the company’s own internal network all started with routine maintenance.
According to infrastructure vice president Santosh Janardhan, a command issued during maintenance inadvertently caused a shutdown of the backbone that connects all of Facebook’s data centers, everywhere in the world.
That by itself is bad enough, but as we’ve already explained, the reason you couldn’t use Facebook is that the DNS and BGP routing information pointing to its servers suddenly disappeared. According to Janardhan, that problem was a secondary issue, as Facebook’s DNS servers noted the loss of connection to the backbone and stopped advertising the BGP routing information that helps every computer on the internet find its servers. The DNS servers were still working, but they were unreachable.
Yesterday’s outage across our products was a bad one, so we’re sharing some more detail here on exactly what happened, how it happened, and what we’re learning from it: https://t.co/IXRt572h4c
— Mike Schroepfer (@schrep) October 5, 2021
The lack of network connections and loss of DNS cut off the servers from engineers trying to fix the issue and disabled many of the tools they normally use for repair and communication — just as we heard yesterday.
The blog post notes that the engineers had additional hurdles due to the physical and system security around this crucial hardware. Once they did “activate the secure access protocols” (this is apparently not a code word for “cut open the server door with an angle grinder), they were able to get the backbone online and slowly restore services in gradually increasing loads. That’s part of the reason it took some people longer to get access back yesterday, as the power and computing demands of turning everything on at once might have caused more crashes.
So that’s it. No conspiracy theories, and no techs taking axes to secure facilities to turn Mark Zuckerberg’s baby back on. Just a bug in a command that an audit tool missed, and for six hours, services that connect billions of people disappeared.