Facebook explains the main reason behind its global outage

The massive outage that brought down the Facebook platform, associated services (WhatsApp, Instagram, Messenger, and Oculus), its corporate platform and the company’s intranet started through routine maintenance.

According to Vice President of Infrastructure Santosh Janardan, a maintenance order inadvertently shut down the backbone connecting all of the company’s data centers everywhere in the world.

“This outage was caused by the system that manages the capacity of our global backbone network,” Janardan said. The backbone is the network the company created to connect all of our computing facilities together, consisting of tens of thousands of miles of fiber-optic cable that traverses the globe and connects all of our data centers.

These data centers come in different forms. Some are huge buildings with millions of machines that store data and power the heavy computing loads that keep the platforms running, others are smaller facilities that connect the company’s core network to the wider Internet and the people who use the platforms.

When you open a corporate app and upload your feed or messages, the app’s request for data travels from your device to the nearest facility, which then communicates directly over the corporate backbone to a larger data center. This is where the information the app needs is retrieved, processed, and sent back over the network to your phone.

Data traffic between all these computing facilities is managed by routers, which determine where to send all incoming and outgoing data.

Facebook engineers often need to be involved in the offline backbone to maintain this infrastructure. This was the source of the interruption.

During one of these routine maintenance jobs, an order was issued to assess the availability of global backbone capacity, which inadvertently cut all connections in the company’s core network, shutting down Facebook’s data centers globally.

Facebook explains the reason for the global shutdown

The company’s systems for auditing orders like this are designed to prevent such errors. But a bug in this checking tool prevented it from stopping the command properly. This change caused a complete disruption of server communications between data centers and the Internet. This complete loss of connectivity caused a second problem with DNS and BGP.

The situation is serious, but the reason you can’t use Facebook is because the DNS and BGP routing information pointing to its servers has suddenly disappeared.

According to Canardan, this issue was a minor issue, as the company’s DNS servers noticed a loss of connectivity to the backbone. It stopped advertising the BGP routing information that helps every computer across the Internet find its servers. The DNS servers were still working. But inaccessible.

A lack of network connections and a loss of DNS left servers cut off for engineers trying to fix the problem. and disable many of the tools they normally use for repair and communication.

point Post However, engineers encountered additional obstacles due to the physical and system security around these critical devices. Once they activated the secure access protocols, they were able to restore the backbone and slowly restore services in gradually increasing loads.

This is part of the reason why some people take longer to access the data again. The power and computing demands of running everything simultaneously may have caused more failures.

