Behind the Facebook Outage: How AWS Peering Challenges Could Disrupt Global Connectivity
Facebook Outage: March 2024
On March 5, 2024, Facebook experienced a significant outage that affected users worldwide, disrupting access to the platform for several hours. The outage was attributed to a technical issue with the company's infrastructure, causing both the website and the mobile app to become inaccessible. This incident led to widespread frustration among users, businesses, and advertisers who rely on the platform for communication and marketing. The company quickly acknowledged the problem and worked diligently to resolve it, ultimately restoring services later in the day. The outage highlighted the challenges and vulnerabilities of managing a massive global network and underscored the importance of robust contingency planning for digital platforms.
Fig : 1 Ping response of Facebook server during outage time
If it is BGP related outage then during outage time, no one user can access Facebook , Instagram and Youtube application. And many reports told that if subscribers already login to application, they can access easily but when logout once they can not access application. It leads towards authentication related error of server side.
Let we check how authentication process done at Facebook end –
Fig:2
Facebook Authorization process flow
The user clicks a button in the React app to connect with Facebook.
The user is sent to Facebook to log in and give permission for the app to access their data.
After granting permission, Facebook sends a special code back to the React app.
The React app sends this code to Facebook and asks for an access token.
Facebook sends back an access token, which the app can use to access the user’s data on Facebook.
This process ensures that the user’s data is accessed securely and only with their permission.
But during outage, users who already login there is no require to pass through authorization process but new users want to login have to pass through authorization process.
It's possible that an AWS Virtual Machine (VM) server issue related to peering connections between different zones could contribute to a service outage like the one experienced by Facebook. Here's how such an issue might arise:
Fig: 3 Facebook application implementation on AWS
- Network Peering and Connectivity Issues:
Peering connections allow different parts of a network, possibly in different geographic regions or availability zones, to communicate with each other. If there is a problem with these connections—such as configuration errors, authorization issues, or outages—it could disrupt communication between servers.
- Authorization Errors:
AWS uses strict security and authorization mechanisms to manage network traffic. If there is an authorization error in the peering setup, it could prevent certain services or regions from communicating effectively, leading to partial or full service outages.
- Impact on Services:
For a large-scale service like Facebook, which relies heavily on a distributed infrastructure to serve users globally, any interruption in internal communication can have widespread effects. For example, if a data center in one region cannot communicate with another due to a peering issue, it could prevent data synchronization or user authentication, leading to an outage.
However, without specific details from Facebook or AWS, it's speculative to pinpoint the exact cause of an outage. Large-scale internet services typically have complex infrastructures, and outages can result from a variety of factors, including hardware failures, software bugs, network issues, or even human error.
Comments
Post a Comment