March 30, 2026radar

How Luxkern Radar Detects Provider Incidents Before Official Status Pages

Full technical transparency on how Luxkern Radar detects provider incidents using anonymized signals, aggregation pipelines, and threshold calibration. Includes pseudocode for the detection algorithm and GDPR compliance details.

luxkern-radarincident-detectionarchitectureprivacyGDPRmonitoringaggregation

How Luxkern Radar Detects Provider Incidents Before Official Status Pages

A developer in Berlin notices their OpenAI API calls are timing out. They check the status page: all green. They spend 12 minutes investigating their own infrastructure before discovering -- through a Slack message from a colleague in Amsterdam -- that half of Europe is experiencing the same thing. The status page updates 18 minutes later.

We built Luxkern Radar to eliminate those 18 minutes. This article explains exactly how it works, with full technical transparency. We are publishing our detection architecture, our privacy mechanisms, and pseudocode for the core algorithm because we believe developers should understand the tools they depend on.

The Core Problem: Single-Point Observation vs. Collective Observation

Every traditional monitoring tool operates from a single vantage point: yours. Your uptime monitor checks your endpoint from a handful of geographic locations. Your APM collects traces from your application. Your log aggregator indexes your logs. When the problem is in your infrastructure, these tools work perfectly. When the problem is in a shared upstream provider, they show you symptoms without context.

The difference between single-point and collective observation is the difference between one thermometer and a weather station network. One thermometer tells you the temperature in your room. A network of thermometers tells you that a cold front is moving across the region. Luxkern Radar is a weather station network for developer infrastructure.

Architecture Overview

The system has four layers: signal collection, anonymization, aggregation, and detection. Each layer is designed to be independently auditable.

[Developer's Environment]
         |
    (1) Signal Collection
    - HTTP status codes
    - Response latency (ms)
    - Error categories
    - Provider + endpoint identifier
    - Timestamp (UTC, rounded to 30s)
         |
    (2) Anonymization (client-side)
    - SHA-256(user_id + monthly_salt)
    - No request/response bodies
    - No API keys
    - No application identifiers
         |
    (3) Aggregation Pipeline
    - Time-windowed bucketing (60s windows)
    - Provider + region + endpoint grouping
    - Statistical aggregation (p50, p95, p99, error rate)
    - Minimum threshold enforcement (15 unique contributors)
         |
    (4) Detection Engine
    - Baseline comparison (rolling 7-day baseline)
    - Weighted anomaly scoring
    - Incident state machine (normal -> warning -> incident -> recovering)
    - Alert dispatch

Layer 1: Signal Collection

Radar collects structured signals from participating developers' environments. These signals are intentionally narrow in scope. We collect five fields per observation:

| Field | Example | Purpose | |---|---|---| | provider | openai | Identify which provider the signal relates to | | endpoint | chat/completions | Distinguish between provider sub-services | | status_code | 429 | Classify response outcome | | latency_ms | 3847 | Measure response time | | timestamp | 2026-09-06T14:30:00Z | Temporal correlation (rounded to 30s) |

We deliberately do not collect request parameters, response bodies, model names used in prompts, token counts, or anything that could reveal what the developer's application is doing. The signal is about the provider's behavior, not the developer's.

Timestamps are rounded to 30-second boundaries before leaving the client. This prevents timing-based correlation attacks while maintaining sufficient temporal resolution for incident detection.

Layer 2: Client-Side Anonymization

Anonymization happens on the client, before any data leaves the developer's environment. This is a critical architectural decision. It means that even if our aggregation service were compromised, the attacker would not receive identifiable data because identifiable data never reaches the service.

import hashlib
import datetime

def anonymize_contributor_id(user_id: str) -> str:
    """
    Generate a non-reversible, non-linkable contributor hash.
    The monthly salt ensures that contributor hashes rotate,
    preventing long-term behavioral tracking.
    """
    # Salt rotates on the first of each month
    monthly_salt = datetime.date.today().strftime("%Y-%m")
    raw = f"{user_id}:{monthly_salt}"
    return hashlib.sha256(raw.encode("utf-8")).hexdigest()

The monthly salt rotation is essential. Without it, a SHA-256 hash of a user ID is effectively a permanent pseudonym. An attacker who compromised the aggregation database could track a single contributor's patterns over months or years, even without knowing their real identity. The monthly rotation means that contributor a3f8... in September is a completely different identifier in October. There is no way to link them.

We considered daily rotation but found it reduced the system's ability to deduplicate signals from the same contributor within a billing cycle. Monthly rotation is the balance point: short enough to prevent meaningful long-term tracking, long enough to allow accurate contributor counting within a reporting period.

Layer 3: Aggregation Pipeline

Raw signals are bucketed into 60-second time windows and grouped by provider, endpoint, and inferred region (derived from the aggregation node that received the signal, not from any client-reported location).

Within each bucket, we compute:

p50, p95, p99 latency: The median, 95th percentile, and 99th percentile response times.

Error rate: The percentage of signals reporting a 4xx or 5xx status code.

429 rate: The percentage of signals reporting a 429 specifically (separated because rate limiting has different operational implications than server errors).

Timeout rate: The percentage of signals where latency exceeded the client's configured timeout.

Contributor count: The number of unique (anonymized) contributors in this bucket.

The contributor count is critical for the next layer. We enforce a strict minimum of 15 unique contributors before any bucket's data is used for detection or displayed to users. This threshold serves two purposes: it prevents false positives from small sample sizes, and it prevents any individual contributor's signal from being isolable in the aggregate.

Layer 4: Detection Engine

The detection engine compares current aggregated signals against a rolling 7-day baseline for the same provider, endpoint, and time-of-day. Time-of-day normalization matters because many providers show predictable latency patterns (higher during US business hours, lower on weekends). Without it, a normal Tuesday afternoon spike could trigger a false positive.

Here is the pseudocode for the core detection algorithm:

from dataclasses import dataclass
from enum import Enum

class IncidentState(Enum):
    NORMAL = "normal"
    WARNING = "warning"
    INCIDENT = "incident"
    RECOVERING = "recovering"

@dataclass
class DetectionConfig:
    min_contributors: int = 15
    warning_threshold: float = 2.0    # 2x baseline = warning
    incident_threshold: float = 4.0   # 4x baseline = incident
    recovery_windows: int = 3         # 3 consecutive normal windows = recovered
    latency_weight: float = 0.3
    error_rate_weight: float = 0.5
    timeout_weight: float = 0.2

def compute_anomaly_score(
    current_bucket: dict,
    baseline: dict,
    config: DetectionConfig,
) -> float:
    """
    Compute a weighted anomaly score comparing current signals
    to the rolling baseline. Returns a multiplier where 1.0 = normal.
    """
    if current_bucket["contributor_count"] < config.min_contributors:
        return 1.0  # Insufficient data, assume normal

    # Latency anomaly: how far is p95 from baseline p95?
    baseline_p95 = max(baseline["latency_p95"], 1)  # Avoid division by zero
    latency_ratio = current_bucket["latency_p95"] / baseline_p95

    # Error rate anomaly: absolute difference matters more than ratio
    # because baseline error rate may be near zero
    error_delta = current_bucket["error_rate"] - baseline["error_rate"]
    error_score = 1.0 + (error_delta * 20)  # 5% increase -> 2.0 score

    # Timeout anomaly: similar to error rate
    timeout_delta = current_bucket["timeout_rate"] - baseline["timeout_rate"]
    timeout_score = 1.0 + (timeout_delta * 20)

    # Weighted combination
    anomaly_score = (
        config.latency_weight * latency_ratio
        + config.error_rate_weight * max(error_score, 1.0)
        + config.timeout_weight * max(timeout_score, 1.0)
    )

    return anomaly_score

def update_incident_state(
    current_state: IncidentState,
    anomaly_score: float,
    consecutive_normal_windows: int,
    config: DetectionConfig,
) -> tuple[IncidentState, int]:
    """
    State machine for incident lifecycle.
    Returns (new_state, updated_consecutive_normal_count).
    """
    if anomaly_score >= config.incident_threshold:
        return IncidentState.INCIDENT, 0

    if anomaly_score >= config.warning_threshold:
        if current_state == IncidentState.INCIDENT:
            return IncidentState.RECOVERING, 0
        return IncidentState.WARNING, 0

    # Score is below warning threshold
    if current_state in (IncidentState.INCIDENT, IncidentState.RECOVERING):
        consecutive_normal_windows += 1
        if consecutive_normal_windows >= config.recovery_windows:
            return IncidentState.NORMAL, 0
        return IncidentState.RECOVERING, consecutive_normal_windows

    if current_state == IncidentState.WARNING:
        return IncidentState.NORMAL, 0

    return IncidentState.NORMAL, 0

Several design decisions in this algorithm are worth explaining.

Error rate spikes are weighted more heavily than latency spikes. A 2x latency increase is annoying but often workable. A 5% error rate increase means requests are failing. We weight error rate at 0.5 (50% of the score) versus 0.3 for latency because errors have a more immediate impact on application availability. This aligns with what we have observed in historical incident data, which we document in our OpenAI API incident reference.

The 15-contributor minimum is non-negotiable. We tested thresholds of 5, 10, 15, and 25 during our beta period. At 5, we saw false positives roughly once per week, typically caused by a small group of users on the same ISP experiencing a local network issue. At 15, false positives dropped to approximately one per quarter. At 25, we started missing legitimate but geographically concentrated incidents. Fifteen is the sweet spot.

The state machine prevents flapping. Without the RECOVERING state and the 3-window recovery requirement, the system would oscillate between INCIDENT and NORMAL during the tail end of an outage when error rates are declining but still elevated. The recovery buffer ensures that we do not declare "all clear" until the provider is genuinely stable.

GDPR Compliance

We designed Radar to be compliant with GDPR by default, not as an afterthought. Here is how each GDPR requirement maps to our architecture:

| GDPR Requirement | How Radar Complies | |---|---| | Lawful basis | Legitimate interest (aggregated, anonymized infrastructure signals) + explicit consent (opt-in) | | Data minimization | Five fields per signal, no PII, no request/response content | | Purpose limitation | Signals used exclusively for provider health detection | | Right to erasure | Monthly salt rotation means all contributor hashes expire automatically; no persistent identity to erase | | Data protection by design | Client-side anonymization; PII never reaches the aggregation service | | Right to object | One-click opt-out; signal contribution is entirely voluntary |

We consulted with a GDPR-specialized law firm during development. Their assessment was that the anonymized, aggregated signals we process do not constitute personal data under GDPR Article 4(1) because they cannot be attributed to an identified or identifiable natural person, even in combination with other data we hold. We maintain this assessment documentation and update it annually.

Why This Detects Incidents Before Status Pages

The timing advantage comes from two structural differences between Radar's detection and a provider's self-reported status:

No human in the loop for detection. Provider status page updates require a human to acknowledge the incident, assess its scope, draft a communication, and publish it. Even with excellent incident management processes, this takes 10-30 minutes. Radar's detection is fully automated: when the anomaly score crosses the threshold, the alert fires. Typical detection latency is 2-4 minutes from the onset of a detectable degradation.

External observation vs. internal monitoring. Providers monitor their own systems from the inside. They see CPU utilization, queue depths, and internal error rates. But some failure modes are only visible from the outside -- for example, a DNS issue that prevents clients in certain regions from resolving the provider's API endpoints. The provider's internal health checks pass because they do not traverse the same network path. Radar sees what developers see, because Radar's signals come from developers.

If you want to understand how to use this detection capability in your own incident response workflow, we wrote a practical guide on monitoring your API endpoints that covers integration patterns.

What We Do Not Do

Transparency requires stating what we do not do as clearly as what we do.

We do not sell data. Aggregated signals are used for detection and displayed to Radar users. They are not sold to providers, investors, or anyone else.

We do not fingerprint users. Beyond the rotating contributor hash, we collect no device identifiers, IP addresses, browser fingerprints, or any other identifier that could be used to track a user across sessions or services.

We do not surveil providers to gain competitive intelligence. Radar detects incidents, not business metrics. We cannot and do not want to infer how many users a provider has, what their revenue is, or what their capacity utilization looks like.

We do not claim perfection. The 15-contributor threshold means we cannot detect incidents that affect fewer than 15 users. Very narrow outages (affecting a single API key, for example) are invisible to community intelligence and always will be. For those, you need your own uptime monitoring.

The Bigger Picture

Radar is one implementation of a broader idea: that developer infrastructure should include a collective awareness layer. We wrote about this concept more broadly in our article on community intelligence as the missing layer in developer infrastructure. The detection algorithm described in this article is the mechanism; the network of developers contributing anonymized signals is the resource that makes it work.

Every contributor makes the system more accurate. Every detection makes every contributor's incident response faster. This is the kind of positive-sum system that the developer ecosystem needs more of.