To prevent a colossal outage in your own network, get serious about DNS protection
On October 4, Facebook was offline for about six hours due to human error. The company states that “configuration changes on our backbone routers” was the cause. In this post, I’ll explain what happened and walk through the takeaways for running your own business network.
Before I get into the details, you first should understand two important internet protocols: the Domain Name System (DNS) and Border Gateway Protocol (BGP). Both of these played important roles in the Facebook outage. DNS is the internet’s phone book, translating names such as facebook.com into the numeric IP address that is used to identify its main servers. You can think of BGP as the internet’s traffic cop, moving billions of packets of data from one place to another, trying to avoid congested or non-working pathways.
We’ve previously written about DNS and why it is important to secure it. BGP has several well-known weaknesses — at least to security experts. Back in 1998, members of an elite hacking group called L0pht Heavy Industries testified before Congress:
They warned that computer networks were embarrassingly insecure, bragging that any one of them could take the entire internet down in just a few minutes thanks to weaknesses in BGP routing.
Unfortunately, this time around, Facebook suffered a self-inflicted wound when one of their network engineers sent a command that basically took the entire company’s server collection off the internet. It mostly followed the ideas first presented in that 1998 testimony. Although Facebook engineers are intelligent folks, they admit the fact that “Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.” Oops.
Steve Bagley, a professor of computer science at the University of Nottingham, has explained what happened and uses a routing mapping tool to illustrate how Facebook servers disappeared from the network connections until it appeared as completely offline. The important thing to note about this outage is that the computer servers were still running — they just weren’t accessible by anyone.
The Facebook outage has certainly been the most noteworthy recent problem related to BGP; however, hackers have been using BGP to complement their attacks for many years. These issues were used in the MEWKit phishing attack back in May 2018 to hijack Amazon's servers and direct traffic to the Russian hackers that were operating the malware.
In Facebook’s post-mortem post, they mention their “storm drills” in which they stress test their computing infrastructure to ensure these system-wide failures don’t happen or can be recovered quickly. Alas, they never simulated the loss of their entire network backbone. They also never put together a scenario where operator error would bring down their network. That should have been part of these drills and you can bet that they will be in the near future.
To prevent a colossal outage in your own network, you can get serious about DNS protection by carrying out the following duties:
The promise of a free movie download led thousands of people into unintended malware.
Avast recently discovered a series of malicious browser extensions on the Chrome Web Store that are spreading adware and hijacked search results.