The folks at Cloudflare have published a fascinating look into the recent ~6 hour long downtime that the Facebook network went through, taking down not just the Facebook product itself, but also WhatsApp, Instagram, FB’s internal looks, and a lot more. It’s a somewhat technical explanation, but Cloudflare’s Tom Strickx and Celso Martinho have made it very easy to understand.
Today at 1651 UTC, we opened an internal incident entitled “Facebook DNS lookup returning SERVFAIL” because we were worried that something was wrong with our DNS resolver 18.104.22.168. But as we were about to post on our public status page we realized something else more serious was going on.
Social media quickly burst into flames, reporting what our engineers rapidly confirmed too. Facebook and its affiliated services WhatsApp and Instagram were, in fact, all down. Their DNS names stopped resolving, and their infrastructure IPs were unreachable. It was as if someone had “pulled the cables” from their data centers all at once and disconnected them from the Internet.
How’s that even possible?
It’s really interesting to see how a (possibly) minor piece of code can take down large parts of the internet like this. Honestly, it would be a good thing for the internet overall of Facebook disappears from the internet, but I feel for everyone at Facebook behind this issue. Major hugs to the people involved in bringing the network back up.
Then again, imagine messing up so bad that your boss ends up losing $6 billion.