The Impact of Infrastructure Failures on AI Platforms: A Case Study of ChatGPT Outage
Foundation:-
ChatGPT is an AI powered conversational platform running on Microsoft’s cloud infrastructure which is available globally. Its seamless functionality relies on uninterrupted power and robust server networks to handle millions of requests at the same time. But a power outage at Microsoft’s South Central US data center on December 26th 2024 showed how critical infrastructure failures can impact these kinds of services. The outage affected ChatGPT globally and users saw high error rates and service unavailability. It wasn’t just casual users who were affected but also businesses and developers who rely on its API for mission critical tasks, highlighting how global AI platforms depend on robust power and server infrastructure. These failures remind us we need stronger backup systems and contingency plans to keep the service up and running and avoid widespread disruptions in the future.
How ChatGPT works ?
• Pre-training:
• A large language model (e.g., GPT-3.5) is trained on 300B tokens of internet data to predict the next word in a sequence.
• Example: Completing sentences like "two plus two is" → "equal to four."
• Fine-tuning:
• The pre-trained model is further fine-tuned with curated demo data.
• A reward model is trained, then PPO is used for reinforcement learning to optimize the responses. That’s the ChatGPT model.
2. Answering a Prompt Phase:
• When a user types a prompt:
1. The input is content moderated.
2. If good, the ChatGPT model responds.
3. The response is content moderated again.
4. If bad at any point, a template response (e.g. refusal or clarification) is used instead.
So responses are accurate and safe.
Comments
Post a Comment