March 25, 2026radar

Anthropic API Status: Historical Incidents and How to Prepare

A permanent reference of Anthropic API incident types, average durations, and impact. Includes retry logic code examples in Python and JavaScript, fallback strategies, circuit breaker patterns, and budget hard stops.

anthropicapi statusincident responseretry logiccircuit breakerradar

Anthropic's official status page updated 14 minutes after the API started returning 503 errors on March 3rd, 2026. Community reports on Twitter appeared within 90 seconds. Luxkern Radar detected the anomaly in 47 seconds. If your application depends on the Anthropic API and you are relying on status.anthropic.com to tell you when something is wrong, you have a 14-minute blind spot where your users are seeing errors, your retry loops are burning credits, and you have no idea.

That 14-minute gap is not unusual. Across the 6 major Anthropic incidents in Q1 2026, the official status page lagged community detection by an average of 11 minutes. Your users notice in 30 seconds. The status page catches up 11 minutes later. The question is not whether Anthropic will have outages -- every API provider does. The question is whether your application survives them gracefully or burns $340 in credits during a retry storm like one of our users did last quarter.

Anthropic Incident Patterns: What the Data Shows

Based on publicly reported incidents and aggregated telemetry from Luxkern Radar, here are the five recurring failure types:

| Incident Type | Avg. Duration | Frequency (12 months) | What You See | |---|---|---|---| | Full API outage | 18-35 min | 3-4x/year | All endpoints return 503. Nothing works. | | Model endpoint degradation | 12-45 min | 6-8x/year | One model fails (e.g., Claude Sonnet) while others work. 15-80% error rate. | | Rate limit spikes | 5-20 min | 10-15x/year | 429 responses surge. Your effective throughput drops. Successful calls return correct data. | | Elevated latency | 15-90 min | 8-12x/year | 200 responses, but p95 latency jumps 3-10x. No errors in logs, but client timeouts fire. | | Regional routing | 20-60 min | 2-3x/year | Errors concentrate in one region. Developers elsewhere see nothing. Hard to detect from a single location. |

Two numbers stand out: model endpoint degradation happens 6-8 times per year, and elevated latency events happen 8-12 times per year. Combined, that is roughly 2-3 incidents per month where your Anthropic integration is partially or fully broken. If your code does not handle these gracefully, your users feel every single one.

Layer 1: Retry Logic with Exponential Backoff and Jitter

Every API call to Anthropic should be wrapped in retry logic. Not naive retries -- exponential backoff with jitter. The jitter is critical because when Anthropic recovers from an outage, every client retries simultaneously. That thundering herd can push the API right back into failure.

import time
import random
import httpx

def call_anthropic(
    messages: list,
    model: str = "claude-sonnet-4-20250514",
    max_tokens: int = 1024,
    max_retries: int = 5,
    base_delay: float = 1.0,
) -> dict:
    """Call the Anthropic API with exponential backoff + jitter."""
    headers = {
        "x-api-key": "your-key-here",
        "anthropic-version": "2023-06-01",
        "Content-Type": "application/json",
    }
    payload = {"model": model, "max_tokens": max_tokens, "messages": messages}

    last_error = None
    for attempt in range(max_retries):
        try:
            response = httpx.post(
                "https://api.anthropic.com/v1/messages",
                headers=headers,
                json=payload,
                timeout=60.0,
            )
            if response.status_code == 200:
                return response.json()

            # Retryable: 429 (rate limit), 500, 502, 503, 529 (overloaded)
            if response.status_code in (429, 500, 502, 503, 529):
                retry_after = response.headers.get("retry-after")
                delay = float(retry_after) if retry_after else base_delay * (2 ** attempt) + random.uniform(0, 1)
                print(f"Attempt {attempt + 1}/{max_retries}: {response.status_code}, retrying in {delay:.1f}s")
                time.sleep(delay)
                continue

            # Non-retryable (400, 401, 403): fail immediately
            response.raise_for_status()

        except httpx.TimeoutException as e:
            last_error = e
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

    raise RuntimeError(f"Anthropic API unavailable after {max_retries} retries") from last_error

Key details: we respect the retry-after header (Anthropic sends it on 429 responses), we include status code 529 (Anthropic-specific "overloaded"), and non-retryable errors like 401 fail immediately -- there is no point retrying a bad API key.

Layer 2: Circuit Breaker Pattern

Retries handle transient failures. But during a sustained 20-minute outage, you do not want your application hammering a dead endpoint for 20 minutes, burning through retry budgets and adding load to an already struggling API. A circuit breaker tracks failure rates and "trips" when failures exceed a threshold.

// circuit-breaker.mjs -- Simple but production-viable circuit breaker
class CircuitBreaker {
  constructor({ failureThreshold = 5, recoveryTimeout = 30000 } = {}) {
    this.failureThreshold = failureThreshold;
    this.recoveryTimeout = recoveryTimeout;
    this.failures = 0;
    this.lastFailure = 0;
    this.state = "closed"; // closed = normal, open = blocking, half-open = probing
  }

  allowRequest() {
    if (this.state === "closed") return true;
    if (this.state === "open") {
      // After recoveryTimeout, allow one probe request
      if (Date.now() - this.lastFailure > this.recoveryTimeout) {
        this.state = "half-open";
        return true;
      }
      return false;
    }
    return true; // half-open: allow probe
  }

  recordSuccess() {
    this.failures = 0;
    this.state = "closed";
  }

  recordFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.failureThreshold) {
      this.state = "open";
      console.log(Circuit OPEN: ${this.failures} consecutive failures);
    }
  }
}

// Usage with Anthropic API
const anthropicBreaker = new CircuitBreaker({ failureThreshold: 5, recoveryTimeout: 30000 });

async function callAnthropicWithBreaker(messages) {
  if (!anthropicBreaker.allowRequest()) {
    console.log("Circuit open -- skipping Anthropic, using fallback");
    return callFallbackModel(messages);
  }

  try {
    const result = await callAnthropic({ messages, maxRetries: 3 });
    anthropicBreaker.recordSuccess();
    return result;
  } catch (err) {
    anthropicBreaker.recordFailure();
    return callFallbackModel(messages);
  }
}

async function callFallbackModel(messages) {
  // Try a different Anthropic model first (may be on different infra)
  // Then fall back to OpenAI
  const fallbacks = [
    { provider: "anthropic", model: "claude-3-haiku-20240307" },
    { provider: "openai", model: "gpt-4o-mini" },
  ];

  for (const fb of fallbacks) {
    try {
      return await callModel(fb.provider, fb.model, messages);
    } catch {
      continue;
    }
  }
  throw new Error("All models unavailable");
}

When the circuit is open, your application skips Anthropic entirely and goes straight to the fallback -- no wasted retries, no added latency. After 30 seconds, it sends a single probe request. If that succeeds, the circuit closes and normal traffic resumes.

The circuit breaker is what transforms a 20-minute outage from "20 minutes of degraded service" into "2 seconds of failed requests, then seamless fallback."

Layer 3: Radar Webhook for Instant Detection

Instead of polling the status page or waiting for your retry logic to detect problems, set up a Radar webhook to get notified the moment community telemetry detects an anomaly:

// radar-webhook-handler.mjs -- Express route for Radar alerts
import express from "express";

const app = express();
app.use(express.json());

app.post("/webhooks/radar", (req, res) => {
  const { provider, status, incident_type, detected_at, details } = req.body;

  if (provider === "anthropic" && status === "degraded") {
    console.log(Radar alert: Anthropic ${incident_type} detected at ${detected_at});
    console.log(Details: ${details});

    // Pre-emptively open your circuit breaker
    anthropicBreaker.state = "open";
    anthropicBreaker.lastFailure = Date.now();

    // Notify your team
    sendSlackAlert({
      channel: "#engineering",
      text: Anthropic API incident detected by Radar: ${incident_type}. Circuit breaker opened. Fallback models active.,
    });

    // Post to your status page
    createStatusPageIncident({
      title: "AI features operating in fallback mode",
      status: "investigating",
      body: We have detected degraded performance on one of our AI providers. AI features remain functional via fallback models. Response quality may vary slightly.,
    });
  }

  if (provider === "anthropic" && status === "operational") {
    console.log("Radar: Anthropic recovered");
    anthropicBreaker.state = "half-open"; // allow probe requests
  }

  res.sendStatus(200);
});

app.listen(4000);

This gives you three advantages: you know about the outage 11 minutes before the official status page updates, you can pre-emptively open your circuit breaker (instead of waiting for 5 failures), and you can notify your users through your own status page within seconds.

Layer 4: Budget Hard Stop

This is the layer that would have saved our user $340. During an outage, retry loops can burn through your API budget in minutes. A budget guard checks cumulative spend before every API call:

class BudgetGuard:
    # Approximate costs per million tokens (update as pricing changes)
    PRICING = {
        "claude-sonnet-4-20250514": {"input": 3.0, "output": 15.0},
        "claude-3-haiku-20240307": {"input": 0.25, "output": 1.25},
    }

    def __init__(self, monthly_limit_usd: float):
        self.monthly_limit = monthly_limit_usd
        self.spent = 0.0

    def check(self):
        if self.spent >= self.monthly_limit:
            raise BudgetExhaustedError(
                f"Monthly budget of ${self.monthly_limit:.2f} exhausted (${self.spent:.2f} spent)"
            )

    def record(self, input_tokens: int, output_tokens: int, model: str):
        rates = self.PRICING.get(model, {"input": 3.0, "output": 15.0})
        cost = (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1_000_000
        self.spent += cost
        if self.spent > self.monthly_limit * 0.8:
            print(f"WARNING: 80% of budget consumed (${self.spent:.2f} / ${self.monthly_limit:.2f})")

Usage
budget = BudgetGuard(monthly_limit_usd=50.0)

def call_with_budget(messages, model="claude-sonnet-4-20250514"):
    budget.check()  # raises if budget exhausted
    result = call_anthropic(messages, model=model)
    budget.record(
        result["usage"]["input_tokens"],
        result["usage"]["output_tokens"],
        model,
    )
    return result

Set your budget limit before you need it. The worst time to configure a spending cap is during a retry storm at 3 AM.

Building Your Runbook

Technical safeguards need a human counterpart. Here is a runbook template your on-call engineer can follow:

Trigger: Anthropic API error rate exceeds 10% for 2 consecutive minutes, OR Radar webhook fires, OR circuit breaker opens.

Step 1 -- Confirm. Check Radar for real-time provider status. Verify your application logs show the circuit breaker is open and fallback models are active.

Step 2 -- Communicate. Post a status page update within 5 minutes: "AI features are operating in fallback mode. Functionality is available but response quality may differ slightly."

Step 3 -- Monitor recovery. Watch Radar for Anthropic's status to return to operational. When it does, the circuit breaker will automatically probe and recover. Monitor your error rate for 10 minutes before closing the incident.

Step 4 -- Post-incident. Within 24 hours, document: how long the fallback was active, estimated cost impact, whether any user-facing degradation occurred, and any threshold adjustments needed.

Historical Lessons Worth Internalizing

Three patterns from past Anthropic incidents:

Short outages are the norm. The median incident duration is under 25 minutes. If your retry logic and circuit breaker can absorb 25 minutes of failure, you ride through most incidents with zero visible user impact.

Model-specific failures are more common than full outages. Having a fallback to a different Anthropic model (Claude Haiku when Sonnet is down) catches the most frequent failure type with the smallest behavior change. Always try a different model on the same provider before switching providers entirely.

Recovery is not instant. When Anthropic resolves an incident, there is a 2-5 minute tail where error rates are declining but not at baseline. Do not slam your circuit breaker shut on the first successful response. Require 3-5 consecutive successes before resuming full traffic. The half-open state in the circuit breaker pattern handles this automatically.

What We Recommend

Implement all four layers: retry with backoff, circuit breaker, Radar webhook, budget guard. Each catches failures the others miss.

Set your budget hard stop today, before you need it.

Test your fallback path by blocking api.anthropic.com in staging and verifying your app degrades gracefully.

Write the runbook now. The worst time to decide who gets paged is when everyone is panicking.

For a similar analysis of OpenAI's failure patterns, read OpenAI API Status: Historical Incidents. For a deeper look at how Radar detects provider incidents before official status pages, see How Luxkern Radar Detects Provider Incidents.

Your AI integration is only as reliable as your worst-case handling. Build the safety nets before you need them. Try Radar free to get real-time provider status and webhook alerts.