Behind the Facebook Outage: How AWS Peering Challenges Could Disrupt Global Connectivity

- July 30, 2024

Facebook Outage: March 2024

On March 5, 2024, Facebook experienced a significant outage that affected users worldwide, disrupting access to the platform for several hours. The outage was attributed to a technical issue with the company's infrastructure, causing both the website and the mobile app to become inaccessible. This incident led to widespread frustration among users, businesses, and advertisers who rely on the platform for communication and marketing. The company quickly acknowledged the problem and worked diligently to resolve it, ultimately restoring services later in the day. The outage highlighted the challenges and vulnerabilities of managing a massive global network and underscored the importance of robust contingency planning for digital platforms.

Fig : 1 Ping response of Facebook server during outage time

If it is BGP related outage then during outage time, no one user can access Facebook , Instagram and Youtube application. And many reports told that if subscribers already login to application, they can access easily but when logout once they can not access application. It leads towards authentication related error of server side.

Let we check how authentication process done at Facebook end –

Fig:2 Facebook Authorization process flow

The user clicks a button in the React app to connect with Facebook.
The user is sent to Facebook to log in and give permission for the app to access their data.
After granting permission, Facebook sends a special code back to the React app.
The React app sends this code to Facebook and asks for an access token.
Facebook sends back an access token, which the app can use to access the user’s data on Facebook.

This process ensures that the user’s data is accessed securely and only with their permission.

But during outage, users who already login there is no require to pass through authorization process but new users want to login have to pass through authorization process.

It's possible that an AWS Virtual Machine (VM) server issue related to peering connections between different zones could contribute to a service outage like the one experienced by Facebook. Here's how such an issue might arise:

Fig: 3 Facebook application implementation on AWS

Network Peering and Connectivity Issues:

Peering connections allow different parts of a network, possibly in different geographic regions or availability zones, to communicate with each other. If there is a problem with these connections—such as configuration errors, authorization issues, or outages—it could disrupt communication between servers.

Authorization Errors:

AWS uses strict security and authorization mechanisms to manage network traffic. If there is an authorization error in the peering setup, it could prevent certain services or regions from communicating effectively, leading to partial or full service outages.

Impact on Services:

For a large-scale service like Facebook, which relies heavily on a distributed infrastructure to serve users globally, any interruption in internal communication can have widespread effects. For example, if a data center in one region cannot communicate with another due to a peering issue, it could prevent data synchronization or user authentication, leading to an outage.

However, without specific details from Facebook or AWS, it's speculative to pinpoint the exact cause of an outage. Large-scale internet services typically have complex infrastructures, and outages can result from a variety of factors, including hardware failures, software bugs, network issues, or even human error.

Networking Fundamental

Technical Networking Army

Behind the Facebook Outage: How AWS Peering Challenges Could Disrupt Global Connectivity

Comments

Post a Comment

Popular posts from this blog

Step-by-Step Guide: Password Recovery for Nokia Routers

Designing a Secure Multi-VPC Architecture with AWS Transit Gateway and IGW

Fixing T-LDP Session Flapping: A Complete Guide for L2VPN Stability