OpenAI API Status: Historical Incidents and Patterns
A permanent reference of OpenAI API incident types, durations, and patterns. Includes retry logic in Python and JavaScript, model fallback strategies, and circuit breaker implementations to keep your application running.
OpenAI API Status: Historical Incidents and Patterns
Your production application returns a 500 to every user because the OpenAI API started returning 429s at 3:14 PM on a Tuesday. You check the OpenAI status page. It says "All Systems Operational." Twenty minutes later, they update it to "Degraded Performance." By then, your error rate has been at 100% for a quarter of an hour, your customers have filed support tickets, and you have been scrambling to understand whether the problem is on your side or theirs.
This scenario has played out hundreds of times since the GPT-4 API became generally available in mid-2023. We built this reference page because we got tired of searching through old tweets and Hacker News threads to confirm that yes, that outage pattern we are seeing right now has happened before. Below is a structured record of OpenAI API incident types, their typical durations, observable patterns, and the engineering countermeasures that actually work.
Incident Categories and Historical Patterns
We have tracked OpenAI API incidents since early 2023 and categorized them into five distinct types. Each category has a recognizable signature in your application logs if you know what to look for.
| Incident Type | Typical Duration | Frequency (2024-2026) | First Observable Signal | |---|---|---|---| | Rate limiting cascades | 15 min - 2 hours | 8-12 per quarter | Sudden spike in 429 responses across multiple API keys | | Model endpoint outages | 30 min - 6 hours | 2-4 per quarter | 500/503 errors on specific model endpoints while others remain healthy | | Authentication / token issues | 10 min - 1 hour | 1-2 per quarter | 401 errors on previously valid API keys, often during key rotation deployments | | Regional degradation | 20 min - 3 hours | 3-5 per quarter | Elevated latency (2-5x baseline) from specific geographic regions before errors appear | | Streaming endpoint failures | 15 min - 2 hours | 4-6 per quarter | SSE connections dropping mid-stream, incomplete responses, chunk parsing errors |
Rate Limiting Cascades
Rate limiting cascades are the most common incident type. They happen when OpenAI tightens rate limits in response to load, but the tightening itself causes a cascade: clients retry immediately, generating more load, triggering tighter limits, and so on. The observable pattern is a sudden wall of 429 responses that does not correlate with any change in your own traffic volume.
During these events, the
x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens headers often show zero across all your API keys simultaneously. If you are seeing zeros on multiple independent keys, it is almost certainly a provider-side issue, not something you triggered.Model Endpoint Outages
These tend to follow a pattern: a specific model (often the newest or most popular one) becomes unavailable while older models continue to function. We have observed this most frequently with
gpt-4o and gpt-4-turbo, while gpt-4o-mini and gpt-3.5-turbo often remain accessible. This is the strongest argument for implementing model fallback logic.Regional Degradation
Perhaps the most insidious incident type. Latency from European and Asian endpoints can spike to 10-30 seconds while US-based calls remain under 2 seconds. These events are hard to detect from a single vantage point and are frequently misdiagnosed as application-level performance issues. This is one of the reasons we believe community intelligence is so valuable: a single developer in Frankfurt cannot distinguish between a local network issue and a regional OpenAI degradation, but if 40 developers in Europe are all seeing the same spike, the answer is obvious.
Retry Logic That Actually Works
The most common mistake we see is naive retry logic that uses fixed intervals or retries immediately. Both make cascading failures worse. Here is what we recommend instead.
Python: Exponential Backoff with Jitter
import time
import random
import openai
def call_openai_with_retry(
messages: list,
model: str = "gpt-4o",
max_retries: int = 5,
base_delay: float = 1.0,
max_delay: float = 60.0,
):
"""
Call the OpenAI API with exponential backoff, jitter,
and automatic model fallback.
"""
fallback_models = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
current_model_index = fallback_models.index(model) if model in fallback_models else 0
client = openai.OpenAI()
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=fallback_models[current_model_index],
messages=messages,
)
return response
except openai.RateLimitError:
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.5)
time.sleep(delay + jitter)
except openai.APIStatusError as e:
if e.status_code in (500, 502, 503):
# Try the next fallback model before retrying
if current_model_index < len(fallback_models) - 1:
current_model_index += 1
continue
delay = min(base_delay * (2 ** attempt), max_delay)
time.sleep(delay + random.uniform(0, delay * 0.3))
else:
raise
raise Exception("All retry attempts and model fallbacks exhausted")JavaScript / TypeScript: Retry with Model Fallback
import OpenAI from "openai";
const FALLBACK_MODELS = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"] as const;
async function callOpenAIWithRetry(
messages: OpenAI.ChatCompletionMessageParam[],
options: {
model?: string;
maxRetries?: number;
baseDelay?: number;
maxDelay?: number;
} = {}
) {
const {
model = "gpt-4o",
maxRetries = 5,
baseDelay = 1000,
maxDelay = 60000,
} = options;
const client = new OpenAI();
let modelIndex = FALLBACK_MODELS.indexOf(model as any);
if (modelIndex === -1) modelIndex = 0;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await client.chat.completions.create({
model: FALLBACK_MODELS[modelIndex],
messages,
});
return response;
} catch (error) {
if (error instanceof OpenAI.RateLimitError) {
const delay = Math.min(baseDelay * 2 ** attempt, maxDelay);
const jitter = Math.random() * delay * 0.5;
await sleep(delay + jitter);
continue;
}
if (
error instanceof OpenAI.APIError &&
[500, 502, 503].includes(error.status ?? 0)
) {
if (modelIndex < FALLBACK_MODELS.length - 1) {
modelIndex++;
continue;
}
const delay = Math.min(baseDelay * 2 ** attempt, maxDelay);
await sleep(delay + Math.random() * delay * 0.3);
continue;
}
throw error;
}
}
throw new Error("All retry attempts and model fallbacks exhausted");
}
function sleep(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}The key principles in both implementations: exponential backoff prevents you from contributing to the cascade, jitter prevents thundering herds when many clients retry at the same time, and model fallback lets you maintain service even when a specific endpoint is down.
Circuit Breaker Pattern
Retry logic handles transient failures, but during a sustained outage, retries just burn through your rate limit budget and add latency for your users. A circuit breaker stops calling the failing endpoint entirely once a threshold is crossed, then periodically tests whether the endpoint has recovered.
import time
from enum import Enum
from dataclasses import dataclass, field
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Blocking all calls
HALF_OPEN = "half_open" # Testing recovery
@dataclass
class CircuitBreaker:
failure_threshold: int = 5
recovery_timeout: float = 30.0
half_open_max_calls: int = 2
state: CircuitState = field(default=CircuitState.CLOSED, init=False)
failure_count: int = field(default=0, init=False)
last_failure_time: float = field(default=0.0, init=False)
half_open_calls: int = field(default=0, init=False)
def can_execute(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time >= self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.half_open_calls = 0
return True
return False
# HALF_OPEN: allow limited calls to test recovery
return self.half_open_calls < self.half_open_max_calls
def record_success(self):
if self.state == CircuitState.HALF_OPEN:
self.half_open_calls += 1
if self.half_open_calls >= self.half_open_max_calls:
self.state = CircuitState.CLOSED
self.failure_count = 0
else:
self.failure_count = 0
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPENWe use one circuit breaker per model endpoint. When
gpt-4o trips the breaker, traffic routes to gpt-4o-mini automatically. When the gpt-4o breaker enters the half-open state, a small number of test calls determine whether the endpoint has recovered.Health Check Endpoint for Your Own Application
Even with retry logic and circuit breakers, you need a way for your monitoring tools to understand the current state of your OpenAI integration. We recommend exposing a dedicated health check endpoint that tests the actual API:
// /api/health/openai/route.ts
import OpenAI from "openai";
import { NextResponse } from "next/server";
const client = new OpenAI();
export async function GET() {
const start = Date.now();
try {
const response = await client.chat.completions.create({
model: "gpt-4o-mini", // Use cheapest model for health checks
messages: [{ role: "user", content: "ping" }],
max_tokens: 1,
});
const latency = Date.now() - start;
return NextResponse.json({
status: "healthy",
latency_ms: latency,
model: "gpt-4o-mini",
degraded: latency > 5000, // Flag if latency exceeds 5s
});
} catch (error) {
return NextResponse.json(
{
status: "unhealthy",
latency_ms: Date.now() - start,
error: error instanceof Error ? error.message : "Unknown error",
},
{ status: 503 }
);
}
}Point your uptime monitoring at this endpoint, and you get proactive alerting when OpenAI degrades rather than discovering it through user complaints. We recommend checking every 60 seconds during business hours and every 5 minutes otherwise to balance cost against detection speed.
Preparing for the Next Incident
Incidents are inevitable. The question is whether you find out from your monitoring or from your users. Based on the patterns we have documented above, here is a practical checklist:
gpt-4o to gpt-4o-mini fallback will save you hours of downtime over the next year.OpenAI has improved their infrastructure stability significantly since 2023, but the API remains a shared resource serving millions of developers. Rate limiting cascades and regional degradation are structural realities, not bugs to be fixed. The developers who handle them gracefully are not the ones with more robust infrastructure; they are the ones who expected the failure and planned for it.