March 27, 2026radar

OpenAI API Status: Historical Incidents and Patterns

A permanent reference of OpenAI API incident types, durations, and patterns. Includes retry logic in Python and JavaScript, model fallback strategies, and circuit breaker implementations to keep your application running.

openaiapi-statusincidentsretry-logiccircuit-breakermonitoringresilience

OpenAI API Status: Historical Incidents and Patterns

Your production application returns a 500 to every user because the OpenAI API started returning 429s at 3:14 PM on a Tuesday. You check the OpenAI status page. It says "All Systems Operational." Twenty minutes later, they update it to "Degraded Performance." By then, your error rate has been at 100% for a quarter of an hour, your customers have filed support tickets, and you have been scrambling to understand whether the problem is on your side or theirs.

This scenario has played out hundreds of times since the GPT-4 API became generally available in mid-2023. We built this reference page because we got tired of searching through old tweets and Hacker News threads to confirm that yes, that outage pattern we are seeing right now has happened before. Below is a structured record of OpenAI API incident types, their typical durations, observable patterns, and the engineering countermeasures that actually work.

Incident Categories and Historical Patterns

We have tracked OpenAI API incidents since early 2023 and categorized them into five distinct types. Each category has a recognizable signature in your application logs if you know what to look for.

| Incident Type | Typical Duration | Frequency (2024-2026) | First Observable Signal | |---|---|---|---| | Rate limiting cascades | 15 min - 2 hours | 8-12 per quarter | Sudden spike in 429 responses across multiple API keys | | Model endpoint outages | 30 min - 6 hours | 2-4 per quarter | 500/503 errors on specific model endpoints while others remain healthy | | Authentication / token issues | 10 min - 1 hour | 1-2 per quarter | 401 errors on previously valid API keys, often during key rotation deployments | | Regional degradation | 20 min - 3 hours | 3-5 per quarter | Elevated latency (2-5x baseline) from specific geographic regions before errors appear | | Streaming endpoint failures | 15 min - 2 hours | 4-6 per quarter | SSE connections dropping mid-stream, incomplete responses, chunk parsing errors |

Rate Limiting Cascades

Rate limiting cascades are the most common incident type. They happen when OpenAI tightens rate limits in response to load, but the tightening itself causes a cascade: clients retry immediately, generating more load, triggering tighter limits, and so on. The observable pattern is a sudden wall of 429 responses that does not correlate with any change in your own traffic volume.

During these events, the x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens headers often show zero across all your API keys simultaneously. If you are seeing zeros on multiple independent keys, it is almost certainly a provider-side issue, not something you triggered.

Model Endpoint Outages

These tend to follow a pattern: a specific model (often the newest or most popular one) becomes unavailable while older models continue to function. We have observed this most frequently with gpt-4o and gpt-4-turbo, while gpt-4o-mini and gpt-3.5-turbo often remain accessible. This is the strongest argument for implementing model fallback logic.

Regional Degradation

Perhaps the most insidious incident type. Latency from European and Asian endpoints can spike to 10-30 seconds while US-based calls remain under 2 seconds. These events are hard to detect from a single vantage point and are frequently misdiagnosed as application-level performance issues. This is one of the reasons we believe community intelligence is so valuable: a single developer in Frankfurt cannot distinguish between a local network issue and a regional OpenAI degradation, but if 40 developers in Europe are all seeing the same spike, the answer is obvious.

Retry Logic That Actually Works

The most common mistake we see is naive retry logic that uses fixed intervals or retries immediately. Both make cascading failures worse. Here is what we recommend instead.

Python: Exponential Backoff with Jitter

import time
import random
import openai

def call_openai_with_retry(
    messages: list,
    model: str = "gpt-4o",
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
):
    """
    Call the OpenAI API with exponential backoff, jitter,
    and automatic model fallback.
    """
    fallback_models = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"]
    current_model_index = fallback_models.index(model) if model in fallback_models else 0

    client = openai.OpenAI()

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=fallback_models[current_model_index],
                messages=messages,
            )
            return response
        except openai.RateLimitError:
            delay = min(base_delay * (2 ** attempt), max_delay)
            jitter = random.uniform(0, delay * 0.5)
            time.sleep(delay + jitter)
        except openai.APIStatusError as e:
            if e.status_code in (500, 502, 503):
                # Try the next fallback model before retrying
                if current_model_index < len(fallback_models) - 1:
                    current_model_index += 1
                    continue
                delay = min(base_delay * (2 ** attempt), max_delay)
                time.sleep(delay + random.uniform(0, delay * 0.3))
            else:
                raise

    raise Exception("All retry attempts and model fallbacks exhausted")

JavaScript / TypeScript: Retry with Model Fallback

import OpenAI from "openai";

const FALLBACK_MODELS = ["gpt-4o", "gpt-4o-mini", "gpt-3.5-turbo"] as const;

async function callOpenAIWithRetry(
  messages: OpenAI.ChatCompletionMessageParam[],
  options: {
    model?: string;
    maxRetries?: number;
    baseDelay?: number;
    maxDelay?: number;
  } = {}
) {
  const {
    model = "gpt-4o",
    maxRetries = 5,
    baseDelay = 1000,
    maxDelay = 60000,
  } = options;

  const client = new OpenAI();
  let modelIndex = FALLBACK_MODELS.indexOf(model as any);
  if (modelIndex === -1) modelIndex = 0;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await client.chat.completions.create({
        model: FALLBACK_MODELS[modelIndex],
        messages,
      });
      return response;
    } catch (error) {
      if (error instanceof OpenAI.RateLimitError) {
        const delay = Math.min(baseDelay * 2 ** attempt, maxDelay);
        const jitter = Math.random() * delay * 0.5;
        await sleep(delay + jitter);
        continue;
      }

      if (
        error instanceof OpenAI.APIError &&
        [500, 502, 503].includes(error.status ?? 0)
      ) {
        if (modelIndex < FALLBACK_MODELS.length - 1) {
          modelIndex++;
          continue;
        }
        const delay = Math.min(baseDelay * 2 ** attempt, maxDelay);
        await sleep(delay + Math.random() * delay * 0.3);
        continue;
      }

      throw error;
    }
  }

  throw new Error("All retry attempts and model fallbacks exhausted");
}

function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

The key principles in both implementations: exponential backoff prevents you from contributing to the cascade, jitter prevents thundering herds when many clients retry at the same time, and model fallback lets you maintain service even when a specific endpoint is down.

Circuit Breaker Pattern

Retry logic handles transient failures, but during a sustained outage, retries just burn through your rate limit budget and add latency for your users. A circuit breaker stops calling the failing endpoint entirely once a threshold is crossed, then periodically tests whether the endpoint has recovered.

import time
from enum import Enum
from dataclasses import dataclass, field

class CircuitState(Enum):
    CLOSED = "closed"       # Normal operation
    OPEN = "open"           # Blocking all calls
    HALF_OPEN = "half_open" # Testing recovery

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    half_open_max_calls: int = 2

    state: CircuitState = field(default=CircuitState.CLOSED, init=False)
    failure_count: int = field(default=0, init=False)
    last_failure_time: float = field(default=0.0, init=False)
    half_open_calls: int = field(default=0, init=False)

    def can_execute(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time >= self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
                return True
            return False
        # HALF_OPEN: allow limited calls to test recovery
        return self.half_open_calls < self.half_open_max_calls

    def record_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.half_open_calls += 1
            if self.half_open_calls >= self.half_open_max_calls:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
        else:
            self.failure_count = 0

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

We use one circuit breaker per model endpoint. When gpt-4o trips the breaker, traffic routes to gpt-4o-mini automatically. When the gpt-4o breaker enters the half-open state, a small number of test calls determine whether the endpoint has recovered.

Health Check Endpoint for Your Own Application

Even with retry logic and circuit breakers, you need a way for your monitoring tools to understand the current state of your OpenAI integration. We recommend exposing a dedicated health check endpoint that tests the actual API:

// /api/health/openai/route.ts
import OpenAI from "openai";
import { NextResponse } from "next/server";

const client = new OpenAI();

export async function GET() {
  const start = Date.now();
  try {
    const response = await client.chat.completions.create({
      model: "gpt-4o-mini", // Use cheapest model for health checks
      messages: [{ role: "user", content: "ping" }],
      max_tokens: 1,
    });

    const latency = Date.now() - start;

    return NextResponse.json({
      status: "healthy",
      latency_ms: latency,
      model: "gpt-4o-mini",
      degraded: latency > 5000, // Flag if latency exceeds 5s
    });
  } catch (error) {
    return NextResponse.json(
      {
        status: "unhealthy",
        latency_ms: Date.now() - start,
        error: error instanceof Error ? error.message : "Unknown error",
      },
      { status: 503 }
    );
  }
}

Point your uptime monitoring at this endpoint, and you get proactive alerting when OpenAI degrades rather than discovering it through user complaints. We recommend checking every 60 seconds during business hours and every 5 minutes otherwise to balance cost against detection speed.

Preparing for the Next Incident

Incidents are inevitable. The question is whether you find out from your monitoring or from your users. Based on the patterns we have documented above, here is a practical checklist:

Implement exponential backoff with jitter on every OpenAI API call. Never use fixed-interval retries.

Build model fallback chains. The 30 minutes it takes to implement gpt-4o to gpt-4o-mini fallback will save you hours of downtime over the next year.

Deploy a circuit breaker per model endpoint to stop hammering a dead API and to route traffic to healthy alternatives automatically.

Expose a health check endpoint and monitor it with interval checks.

Use a tool that aggregates signals from multiple developers, not just your own vantage point. The difference between knowing an outage is happening before the status page updates and finding out 20 minutes later is the difference between a minor blip and an incident that wakes up your on-call engineer.

OpenAI has improved their infrastructure stability significantly since 2023, but the API remains a shared resource serving millions of developers. Rate limiting cascades and regional degradation are structural realities, not bugs to be fixed. The developers who handle them gracefully are not the ones with more robust infrastructure; they are the ones who expected the failure and planned for it.