March 16, 2026sentinel

Automatic Root Cause Analysis for Production Incidents

Manual RCA takes hours and often misses the real cause. Automatic root cause analysis using signal correlation and AI diagnosis does it in 90 seconds. Here's how it works.

root-cause-analysisincident-managementrcasentinelpostmortemdevops

Automatic Root Cause Analysis for Production Incidents

Your checkout API went down for 22 minutes last Tuesday. The CEO wants to know why. The product manager wants to know if it will happen again. Your team lead wants a postmortem document by Friday. You are now spending half a day reconstructing what happened, cross-referencing logs from three services, deployment timestamps, and infrastructure metrics, trying to answer a question that boils down to: "What actually broke, and why?"

This is root cause analysis, and for most engineering teams it is still an entirely manual process that takes hours, produces inconsistent results, and often lands on "we think it was X but we're not 100% sure." Meanwhile, the signals needed to answer the question were all available within 90 seconds of the incident starting. They were just scattered across five different tools, and no human could correlate them that fast at 3am.

Automatic root cause analysis changes this equation fundamentally. Not by replacing human judgment, but by doing the tedious correlation and timeline reconstruction work that consumes 80% of RCA time, and presenting engineers with a coherent picture they can verify and act on.

What Root Cause Analysis Actually Is

Root cause analysis is the process of identifying why an incident happened, not just what happened. The distinction matters. "The API returned 502 errors" is what happened. "A database migration locked the users table for 4 minutes during peak traffic, causing connection timeouts that cascaded to the API layer" is why it happened.

Good RCA answers three questions:

What was the triggering event? The specific change, failure, or condition that initiated the incident.

Why did the triggering event cause the observed impact? The causal chain from trigger to user-facing symptoms.

Why didn't our safeguards catch it? What monitoring, testing, or review process should have prevented this but did not.

The output of RCA is not just an explanation -- it is a set of action items that prevent recurrence. "Add a lock timeout to all database migrations" is a good RCA action item. "Be more careful with migrations" is not.

Traditional RCA Methods and Why They Struggle

The 5 Whys

The 5 Whys technique, originally from Toyota's manufacturing process, asks "why?" iteratively until you reach a fundamental cause.

Why did the API go down? Because the database connections timed out.

Why did database connections time out? Because the users table was locked.

Why was the users table locked? Because a migration added a column with a default value.

Why did the migration run during peak traffic? Because the deployment pipeline has no traffic-aware scheduling.

Why is there no traffic-aware scheduling? Because nobody prioritized building it.

This technique works for linear cause-and-effect chains. It fails for distributed systems because incidents rarely have a single causal chain. They have multiple contributing factors that interact. The database migration locked the table, but the API had no circuit breaker, and the connection pool had no timeout, and the health check did not verify database connectivity. The 5 Whys forces you to pick one chain, which means you miss the other contributing factors.

Fishbone Diagrams (Ishikawa)

Fishbone diagrams categorize potential causes into groups (people, process, technology, environment) and map out contributing factors visually. They are excellent for brainstorming sessions where the team explores all possible causes. They are terrible for 3am incidents where you need an answer in minutes, not hours. Drawing a fishbone diagram requires a group of people, a whiteboard (or virtual equivalent), and time to discuss. This is a postmortem tool, not an incident response tool.

Timeline Reconstruction

The most common practical RCA method is timeline reconstruction: gather timestamps from every relevant system, arrange them chronologically, and trace the causal chain forward from the first anomaly. This is effective but brutally time-consuming.

For a typical incident involving three services, a database, and a deployment, the timeline reconstruction requires:

Pulling logs from the API service and finding the first error.

Pulling logs from the worker service to check if it was affected.

Checking the deployment history to find recent changes.

Checking infrastructure metrics (CPU, memory, connections) for anomalies.

Checking the cron job monitor for missed heartbeats.

Cross-referencing all these timestamps (which are often in different timezones or formats).

Building a chronological narrative.

Steps 1-6 take 30-90 minutes depending on how many tools are involved and how accessible the data is. Step 7 takes another 30 minutes to write up. The entire process easily consumes half a day when you include the time to write the postmortem document.

Why Manual RCA Fails for Distributed Systems

Distributed systems break in distributed ways. A single root cause produces symptoms across multiple services, multiple monitoring tools, and multiple infrastructure layers. The signal-to-noise ratio is low because there are always background anomalies unrelated to the incident.

Consider a real-world scenario: an AWS availability zone has elevated network latency for 90 seconds. During those 90 seconds:

Three API services see increased response times.

Two of those services hit timeout thresholds and start returning 503s.

The load balancer health checks fail and take one service out of rotation.

A cron job that calls the internal API times out and misses its execution window.

The error tracking service shows spikes across all three API services.

The log aggregator shows connection reset errors from services to the database (which is in the affected AZ).

A completely unrelated deploy happens at the same time (coincidence), adding noise to the timeline.

A human performing manual RCA has to determine that all of these symptoms trace back to one root cause (AZ network latency), the unrelated deploy is coincidental, and the cron job failure is a downstream effect rather than a separate issue. This requires correlating timestamps across 6+ data sources, understanding the dependency graph between services, and filtering out the coincidental deploy.

An experienced SRE can do this in 30-45 minutes. A developer who is not an SRE specialist might take 2 hours. An AI system with access to all the data and a dependency graph can do it in 90 seconds.

How Automatic RCA Works

Automatic RCA is not a single algorithm. It is a pipeline that combines data aggregation, temporal correlation, dependency-aware analysis, and AI-powered narrative generation. Here is how the pipeline works in Luxkern Sentinel.

Stage 1: Signal Collection (0-60 seconds)

When the first alert fires, Sentinel starts a correlation window (default: 5 minutes). During this window, it collects every signal from every connected source:

Uptime check failures from PingCheck

Cron job missed heartbeats from CronSafe

Log anomalies and error spikes from LogDrain

Deployment events from CI/CD webhooks

Infrastructure metrics from connected cloud providers

Third-party alerts from webhook integrations (Sentry, AWS CloudWatch, etc.)

Each signal includes a timestamp, source, severity, affected service, and raw data payload.

Stage 2: Temporal Clustering (automatic)

Signals are grouped by time proximity. If an uptime check fails at 14:32:01 and a log error spike starts at 14:31:45 and a cron job misses its heartbeat at 14:32:30, these are temporally clustered as potentially related. The clustering algorithm uses a configurable window (default 5 minutes) with weighted edges -- signals closer in time are more strongly linked.

Stage 3: Dependency Graph Overlay

The temporal clusters are then cross-referenced with the service dependency graph. If api-main depends on prod-db-01, and both show anomalies in the same temporal cluster, the dependency relationship strengthens the correlation. If two services show anomalies at the same time but have no dependency relationship, the correlation is weaker (they might both be affected by a shared infrastructure issue, or it might be coincidence).

// Example dependency graph that powers correlation
const dependencyGraph = {
  "api-main": {
    upstream: ["cdn", "load-balancer"],
    downstream: ["prod-db-01", "redis-01", "stripe-api"],
    cron_jobs: ["cleanup-sessions", "refresh-cache"]
  },
  "worker-billing": {
    upstream: ["queue-sqs"],
    downstream: ["prod-db-01", "stripe-api"],
    cron_jobs: ["process-payments", "generate-invoices"]
  },
  "prod-db-01": {
    type: "postgresql",
    dependents: ["api-main", "worker-billing", "analytics-service"],
    metrics: ["connection_count", "replication_lag", "disk_usage"]
  }
};

Stage 4: AI Analysis (60-90 seconds)

With the correlated signals and dependency context assembled, the AI (Claude Sonnet) receives a structured prompt containing:

All signals in the temporal cluster, with timestamps and raw data

The service dependency graph

Recent deployment events (last 24 hours)

Historical incident patterns for this infrastructure (if available)

The specific question: "What is the most likely root cause, what is the causal chain, and what is the confidence level?"

The AI produces a structured analysis that includes the identified root cause, confidence level, causal chain, affected services, timeline, and suggested remediation steps.

Stage 5: Human Verification

This is the critical step that distinguishes responsible AI-powered RCA from reckless automation. The AI analysis is presented to the on-call engineer alongside all raw signals. The engineer can:

Confirm the analysis if it matches their assessment.

Adjust the root cause if the AI got close but not exactly right.

Reject the analysis if it is wrong, providing the correct root cause.

This feedback loop improves the system over time for your specific infrastructure.

A Concrete Example: Automated RCA Walkthrough

Let us walk through a real scenario from beginning to end.

The Incident

At 14:31 UTC on a Tuesday, users start reporting that checkout is failing on an e-commerce application. The engineering team has the following monitoring in place:

PingCheck monitoring the API health endpoint every 30 seconds

CronSafe monitoring the process-payments job that runs every 5 minutes

LogDrain ingesting logs from api-main and worker-billing

Sentry connected via webhook for error tracking

GitHub Actions connected via webhook for deployment events

Signals Received (14:31 - 14:35 UTC)

14:31:12  [LogDrain/api-main]      ERROR: Redis connection refused
                                    (redis-01.internal:6379)
14:31:15  [LogDrain/api-main]      ERROR: Failed to read session data
                                    - ECONNREFUSED
14:31:30  [PingCheck]              WARNING: api.store.com/health
                                    response time 4200ms (threshold: 2000ms)
14:31:45  [Sentry/webhook]         Error spike: SessionReadError
                                    (42 events in 60s)
14:32:00  [PingCheck]              CRITICAL: api.store.com/health
                                    returned 503
14:32:15  [LogDrain/worker-billing] ERROR: Cannot acquire lock for
                                    payment processing - Redis unavailable
14:33:00  [CronSafe]               MISSED: process-payments heartbeat
                                    (expected 14:30, not received)
14:34:01  [GitHub Actions/webhook]  Deploy completed: worker-billing
                                    v2.14.3 (commit: a]f9c21)

Sentinel Correlation Output

After the 5-minute correlation window, Sentinel produces the following:

## Automated Root Cause Analysis

Incident ID: INC-2026-0819-001
Duration so far: 4 minutes
Status: Active

---

Root Cause (HIGH confidence)


Redis instance redis-01.internal became unreachable at
approximately 14:31:10 UTC.

Causal Chain


redis-01.internal stopped accepting connections (trigger)
api-main failed to read session data from Redis (direct impact)
Health endpoint response time degraded, then returned 503
   (cascade - session middleware blocks request pipeline)
worker-billing could not acquire distributed lock for payment
   processing (parallel impact - same Redis dependency)
process-payments cron job timed out waiting for lock, missed
   heartbeat (downstream of worker-billing)

Excluded Signal


The GitHub Actions deploy of worker-billing v2.14.3 at 14:34:01
completed AFTER the incident started (14:31:10). This deploy is
NOT the cause. However, deploying during an active incident may
complicate recovery. Consider rolling back this deploy as a
precaution.

Affected Services


| Service | Impact | Severity |
|---|---|---|
| api-main | 503 errors, sessions unavailable | CRITICAL |
| worker-billing | Payment processing blocked | HIGH |
| cron: process-payments | Missed execution | MEDIUM |

Suggested Remediation


Immediate: Check Redis instance status
   - Is the process running? redis-cli -h redis-01.internal ping
   - Check memory: redis-cli -h redis-01.internal info memory
   - Check connectivity from API server:
     nc -zv redis-01.internal 6379

If Redis is down: Restart the instance or fail over to
   the replica (if configured)

If Redis is reachable but refusing connections:
   Check maxclients setting and current connection count

Confidence Notes


HIGH confidence that Redis is the root cause (3 independent
  services show Redis connection failures starting at the same time)
The deploy at 14:34 is almost certainly unrelated (started after
  incident began) but cannot be 100% excluded
Unknown: WHY Redis went down. Once the instance is back, check
  Redis logs for OOM killer, disk full, or crash signals

---

*This analysis was generated automatically. Raw signals are
available below. Always verify before taking action.*

What Took 90 Seconds vs. What Would Have Taken 45 Minutes

Let us compare what the automated process did versus what a human would have done manually.

Automated (90 seconds):

Collected 8 signals from 5 sources

Identified that all signals trace back to Redis unavailability

Correctly excluded the deployment as a coincidence

Generated a timeline, causal chain, and remediation steps

Flagged what it does not know (why Redis went down)

Manual (estimated 30-45 minutes):

Open PingCheck dashboard, see the 503 alerts (2 min)

Open LogDrain, search for errors around 14:31 (3 min)

Notice Redis errors in api-main logs (2 min)

Check if worker-billing is also affected -- open its logs (3 min)

See the CronSafe missed heartbeat, wonder if it is related (2 min)

Notice the GitHub Actions deploy, spend time checking if the deploy caused it (5-10 min)

Conclude it is Redis after reading enough log lines (5 min)

Look up Redis connection commands to diagnose (3 min)

Write up findings for the team (10 min)

The automated process is not smarter than the human. It just has access to all the data simultaneously and can correlate it in parallel, whereas a human has to sequentially open tools, read outputs, and build a mental model.

Setting Up Automatic RCA

Getting automatic RCA working requires three things: connected data sources, a service dependency graph, and a feedback loop.

Connect Your Data Sources

The more signal sources Sentinel has access to, the more accurate the correlation. At minimum, you need two of the following. For comprehensive RCA, connect all available sources.

// sentinel-sources.config.ts
export const sources = {
  // Luxkern native -- auto-connected
  pingcheck: { enabled: true },
  cronsafe: { enabled: true },
  logdrain: { enabled: true },

  // Third-party via webhooks
  webhooks: [
    {
      name: "sentry",
      parser: "sentry-v4",
      // Filter to production only
      filter: { environment: "production" }
    },
    {
      name: "github-actions",
      parser: "github-actions",
      // Only deployment events
      filter: { event: "deployment" }
    },
    {
      name: "aws-cloudwatch",
      parser: "cloudwatch-alarm",
      filter: { severity: ["ALARM"] }
    }
  ]
};

Define Your Dependency Graph

This is the single most impactful thing you can do for RCA accuracy. Without a dependency graph, the AI can only use temporal correlation (things that happen at the same time might be related). With a dependency graph, it can use causal correlation (service A depends on service B, and service B broke, so service A's failure is caused by service B). We covered the importance of understanding service dependencies in our guide on how to calculate SLA uptime and downtime -- your SLA is only as good as your weakest dependency.

Provide Feedback

After every incident, take 30 seconds to confirm or correct the AI analysis. This is not optional busywork -- it directly improves future accuracy. After 10-15 incidents with feedback, the system has learned the specific patterns of your infrastructure (e.g., "when api-main shows ECONNREFUSED and worker-billing shows lock timeout, it is always Redis, never Postgres").

Limitations and Honest Caveats

Automatic RCA is powerful but not infallible. Here are the limitations we have observed.

Novel failure modes. If the root cause is something the AI has never seen in any context (a cosmic ray bit flip, a kernel bug in a specific patch version, a misconfigured firewall rule that only triggers under specific traffic patterns), the AI will correctly correlate the symptoms but may suggest a wrong root cause. It will still save time on correlation, but the diagnosis step may require human expertise.

Missing signals. If the actual root cause does not produce any signal in any connected monitoring tool, the AI cannot identify it. For example, if a DNS change causes failures but you do not monitor DNS, the AI will see the symptoms but not the cause. The fix is to connect more signal sources, but you cannot monitor what you do not know to monitor.

Complex multi-cause incidents. Sometimes an incident has two or more independent root causes that happen to overlap. The AI may attribute all symptoms to one cause and miss the second. Showing raw signals alongside the AI analysis helps engineers spot when the AI has missed something.

Confidence calibration. The AI's confidence levels are probabilistic, not certain. "HIGH confidence" means the causal chain is well-supported by multiple correlated signals and dependency relationships. It does not mean the analysis is guaranteed correct. Always verify. We surface the raw logs and metrics alongside the RCA summary for exactly this reason -- if you want to explore deeper, check out our guide on centralizing logs to make this data accessible.

From Incident to Prevention

The ultimate goal of RCA is not explaining what happened -- it is preventing it from happening again. Automatic RCA accelerates the path to prevention because the structured output (root cause, causal chain, affected services) maps directly to action items.

From the Redis example above, the automatic action items would be:

Set up Redis monitoring. Add Redis health check and memory metrics to PingCheck and LogDrain.

Add circuit breaker to api-main. If Redis is unavailable, degrade gracefully instead of returning 503.

Configure Redis failover. Set up a Redis replica with automatic failover.

Add Redis connectivity to health check. The health endpoint should verify Redis is reachable.

These action items come directly from the automated analysis, not from a 2-hour brainstorming session two weeks after the incident. They are specific, actionable, and directly tied to the failure mode that was observed.

Conclusion

Manual root cause analysis is a necessary process trapped in an outdated methodology. It takes hours, produces inconsistent results, and often concludes with "we think this is what happened." Automatic RCA does not eliminate the need for human judgment, but it eliminates the tedious correlation work that consumes most of the time. Ninety seconds of automated signal collection, dependency-aware correlation, and AI-powered analysis replaces 30-60 minutes of a human opening tabs, reading logs, and building a mental timeline.

The investment is connecting your monitoring sources and defining your dependency graph. The payoff is getting a structured root cause analysis in under two minutes instead of scheduling a postmortem meeting three days later where everyone tries to remember what happened. Your future 3am self will thank you.