Automatic Root Cause Analysis for Production Incidents
Manual RCA takes hours and often misses the real cause. Automatic root cause analysis using signal correlation and AI diagnosis does it in 90 seconds. Here's how it works.
Automatic Root Cause Analysis for Production Incidents
Your checkout API went down for 22 minutes last Tuesday. The CEO wants to know why. The product manager wants to know if it will happen again. Your team lead wants a postmortem document by Friday. You are now spending half a day reconstructing what happened, cross-referencing logs from three services, deployment timestamps, and infrastructure metrics, trying to answer a question that boils down to: "What actually broke, and why?"
This is root cause analysis, and for most engineering teams it is still an entirely manual process that takes hours, produces inconsistent results, and often lands on "we think it was X but we're not 100% sure." Meanwhile, the signals needed to answer the question were all available within 90 seconds of the incident starting. They were just scattered across five different tools, and no human could correlate them that fast at 3am.
Automatic root cause analysis changes this equation fundamentally. Not by replacing human judgment, but by doing the tedious correlation and timeline reconstruction work that consumes 80% of RCA time, and presenting engineers with a coherent picture they can verify and act on.
What Root Cause Analysis Actually Is
Root cause analysis is the process of identifying why an incident happened, not just what happened. The distinction matters. "The API returned 502 errors" is what happened. "A database migration locked the users table for 4 minutes during peak traffic, causing connection timeouts that cascaded to the API layer" is why it happened.
Good RCA answers three questions:
The output of RCA is not just an explanation -- it is a set of action items that prevent recurrence. "Add a lock timeout to all database migrations" is a good RCA action item. "Be more careful with migrations" is not.
Traditional RCA Methods and Why They Struggle
The 5 Whys
The 5 Whys technique, originally from Toyota's manufacturing process, asks "why?" iteratively until you reach a fundamental cause.
This technique works for linear cause-and-effect chains. It fails for distributed systems because incidents rarely have a single causal chain. They have multiple contributing factors that interact. The database migration locked the table, but the API had no circuit breaker, and the connection pool had no timeout, and the health check did not verify database connectivity. The 5 Whys forces you to pick one chain, which means you miss the other contributing factors.
Fishbone Diagrams (Ishikawa)
Fishbone diagrams categorize potential causes into groups (people, process, technology, environment) and map out contributing factors visually. They are excellent for brainstorming sessions where the team explores all possible causes. They are terrible for 3am incidents where you need an answer in minutes, not hours. Drawing a fishbone diagram requires a group of people, a whiteboard (or virtual equivalent), and time to discuss. This is a postmortem tool, not an incident response tool.
Timeline Reconstruction
The most common practical RCA method is timeline reconstruction: gather timestamps from every relevant system, arrange them chronologically, and trace the causal chain forward from the first anomaly. This is effective but brutally time-consuming.
For a typical incident involving three services, a database, and a deployment, the timeline reconstruction requires:
Steps 1-6 take 30-90 minutes depending on how many tools are involved and how accessible the data is. Step 7 takes another 30 minutes to write up. The entire process easily consumes half a day when you include the time to write the postmortem document.
Why Manual RCA Fails for Distributed Systems
Distributed systems break in distributed ways. A single root cause produces symptoms across multiple services, multiple monitoring tools, and multiple infrastructure layers. The signal-to-noise ratio is low because there are always background anomalies unrelated to the incident.
Consider a real-world scenario: an AWS availability zone has elevated network latency for 90 seconds. During those 90 seconds:
A human performing manual RCA has to determine that all of these symptoms trace back to one root cause (AZ network latency), the unrelated deploy is coincidental, and the cron job failure is a downstream effect rather than a separate issue. This requires correlating timestamps across 6+ data sources, understanding the dependency graph between services, and filtering out the coincidental deploy.
An experienced SRE can do this in 30-45 minutes. A developer who is not an SRE specialist might take 2 hours. An AI system with access to all the data and a dependency graph can do it in 90 seconds.
How Automatic RCA Works
Automatic RCA is not a single algorithm. It is a pipeline that combines data aggregation, temporal correlation, dependency-aware analysis, and AI-powered narrative generation. Here is how the pipeline works in Luxkern Sentinel.
Stage 1: Signal Collection (0-60 seconds)
When the first alert fires, Sentinel starts a correlation window (default: 5 minutes). During this window, it collects every signal from every connected source:
Each signal includes a timestamp, source, severity, affected service, and raw data payload.
Stage 2: Temporal Clustering (automatic)
Signals are grouped by time proximity. If an uptime check fails at 14:32:01 and a log error spike starts at 14:31:45 and a cron job misses its heartbeat at 14:32:30, these are temporally clustered as potentially related. The clustering algorithm uses a configurable window (default 5 minutes) with weighted edges -- signals closer in time are more strongly linked.
Stage 3: Dependency Graph Overlay
The temporal clusters are then cross-referenced with the service dependency graph. If
api-main depends on prod-db-01, and both show anomalies in the same temporal cluster, the dependency relationship strengthens the correlation. If two services show anomalies at the same time but have no dependency relationship, the correlation is weaker (they might both be affected by a shared infrastructure issue, or it might be coincidence).// Example dependency graph that powers correlation
const dependencyGraph = {
"api-main": {
upstream: ["cdn", "load-balancer"],
downstream: ["prod-db-01", "redis-01", "stripe-api"],
cron_jobs: ["cleanup-sessions", "refresh-cache"]
},
"worker-billing": {
upstream: ["queue-sqs"],
downstream: ["prod-db-01", "stripe-api"],
cron_jobs: ["process-payments", "generate-invoices"]
},
"prod-db-01": {
type: "postgresql",
dependents: ["api-main", "worker-billing", "analytics-service"],
metrics: ["connection_count", "replication_lag", "disk_usage"]
}
};Stage 4: AI Analysis (60-90 seconds)
With the correlated signals and dependency context assembled, the AI (Claude Sonnet) receives a structured prompt containing:
The AI produces a structured analysis that includes the identified root cause, confidence level, causal chain, affected services, timeline, and suggested remediation steps.
Stage 5: Human Verification
This is the critical step that distinguishes responsible AI-powered RCA from reckless automation. The AI analysis is presented to the on-call engineer alongside all raw signals. The engineer can:
This feedback loop improves the system over time for your specific infrastructure.
A Concrete Example: Automated RCA Walkthrough
Let us walk through a real scenario from beginning to end.
The Incident
At 14:31 UTC on a Tuesday, users start reporting that checkout is failing on an e-commerce application. The engineering team has the following monitoring in place:
process-payments job that runs every 5 minutesapi-main and worker-billingSignals Received (14:31 - 14:35 UTC)
14:31:12 [LogDrain/api-main] ERROR: Redis connection refused
(redis-01.internal:6379)
14:31:15 [LogDrain/api-main] ERROR: Failed to read session data
- ECONNREFUSED
14:31:30 [PingCheck] WARNING: api.store.com/health
response time 4200ms (threshold: 2000ms)
14:31:45 [Sentry/webhook] Error spike: SessionReadError
(42 events in 60s)
14:32:00 [PingCheck] CRITICAL: api.store.com/health
returned 503
14:32:15 [LogDrain/worker-billing] ERROR: Cannot acquire lock for
payment processing - Redis unavailable
14:33:00 [CronSafe] MISSED: process-payments heartbeat
(expected 14:30, not received)
14:34:01 [GitHub Actions/webhook] Deploy completed: worker-billing
v2.14.3 (commit: a]f9c21)Sentinel Correlation Output
After the 5-minute correlation window, Sentinel produces the following:
## Automated Root Cause Analysis
Incident ID: INC-2026-0819-001
Duration so far: 4 minutes
Status: Active
---
Root Cause (HIGH confidence)
Redis instance redis-01.internal became unreachable at
approximately 14:31:10 UTC.
Causal Chain
redis-01.internal stopped accepting connections (trigger)
api-main failed to read session data from Redis (direct impact)
Health endpoint response time degraded, then returned 503
(cascade - session middleware blocks request pipeline)
worker-billing could not acquire distributed lock for payment
processing (parallel impact - same Redis dependency)
process-payments cron job timed out waiting for lock, missed
heartbeat (downstream of worker-billing)
Excluded Signal
The GitHub Actions deploy of worker-billing v2.14.3 at 14:34:01
completed AFTER the incident started (14:31:10). This deploy is
NOT the cause. However, deploying during an active incident may
complicate recovery. Consider rolling back this deploy as a
precaution.
Affected Services
| Service | Impact | Severity |
|---|---|---|
| api-main | 503 errors, sessions unavailable | CRITICAL |
| worker-billing | Payment processing blocked | HIGH |
| cron: process-payments | Missed execution | MEDIUM |
Suggested Remediation
Immediate: Check Redis instance status
- Is the process running? redis-cli -h redis-01.internal ping
- Check memory: redis-cli -h redis-01.internal info memory
- Check connectivity from API server:
nc -zv redis-01.internal 6379
If Redis is down: Restart the instance or fail over to
the replica (if configured)
If Redis is reachable but refusing connections:
Check maxclients setting and current connection count
Confidence Notes
HIGH confidence that Redis is the root cause (3 independent
services show Redis connection failures starting at the same time)
The deploy at 14:34 is almost certainly unrelated (started after
incident began) but cannot be 100% excluded
Unknown: WHY Redis went down. Once the instance is back, check
Redis logs for OOM killer, disk full, or crash signals
---
*This analysis was generated automatically. Raw signals are
available below. Always verify before taking action.*What Took 90 Seconds vs. What Would Have Taken 45 Minutes
Let us compare what the automated process did versus what a human would have done manually.
Automated (90 seconds):
Manual (estimated 30-45 minutes):
The automated process is not smarter than the human. It just has access to all the data simultaneously and can correlate it in parallel, whereas a human has to sequentially open tools, read outputs, and build a mental model.
Setting Up Automatic RCA
Getting automatic RCA working requires three things: connected data sources, a service dependency graph, and a feedback loop.
Connect Your Data Sources
The more signal sources Sentinel has access to, the more accurate the correlation. At minimum, you need two of the following. For comprehensive RCA, connect all available sources.
// sentinel-sources.config.ts
export const sources = {
// Luxkern native -- auto-connected
pingcheck: { enabled: true },
cronsafe: { enabled: true },
logdrain: { enabled: true },
// Third-party via webhooks
webhooks: [
{
name: "sentry",
parser: "sentry-v4",
// Filter to production only
filter: { environment: "production" }
},
{
name: "github-actions",
parser: "github-actions",
// Only deployment events
filter: { event: "deployment" }
},
{
name: "aws-cloudwatch",
parser: "cloudwatch-alarm",
filter: { severity: ["ALARM"] }
}
]
};Define Your Dependency Graph
This is the single most impactful thing you can do for RCA accuracy. Without a dependency graph, the AI can only use temporal correlation (things that happen at the same time might be related). With a dependency graph, it can use causal correlation (service A depends on service B, and service B broke, so service A's failure is caused by service B). We covered the importance of understanding service dependencies in our guide on how to calculate SLA uptime and downtime -- your SLA is only as good as your weakest dependency.
Provide Feedback
After every incident, take 30 seconds to confirm or correct the AI analysis. This is not optional busywork -- it directly improves future accuracy. After 10-15 incidents with feedback, the system has learned the specific patterns of your infrastructure (e.g., "when api-main shows ECONNREFUSED and worker-billing shows lock timeout, it is always Redis, never Postgres").
Limitations and Honest Caveats
Automatic RCA is powerful but not infallible. Here are the limitations we have observed.
Novel failure modes. If the root cause is something the AI has never seen in any context (a cosmic ray bit flip, a kernel bug in a specific patch version, a misconfigured firewall rule that only triggers under specific traffic patterns), the AI will correctly correlate the symptoms but may suggest a wrong root cause. It will still save time on correlation, but the diagnosis step may require human expertise.
Missing signals. If the actual root cause does not produce any signal in any connected monitoring tool, the AI cannot identify it. For example, if a DNS change causes failures but you do not monitor DNS, the AI will see the symptoms but not the cause. The fix is to connect more signal sources, but you cannot monitor what you do not know to monitor.
Complex multi-cause incidents. Sometimes an incident has two or more independent root causes that happen to overlap. The AI may attribute all symptoms to one cause and miss the second. Showing raw signals alongside the AI analysis helps engineers spot when the AI has missed something.
Confidence calibration. The AI's confidence levels are probabilistic, not certain. "HIGH confidence" means the causal chain is well-supported by multiple correlated signals and dependency relationships. It does not mean the analysis is guaranteed correct. Always verify. We surface the raw logs and metrics alongside the RCA summary for exactly this reason -- if you want to explore deeper, check out our guide on centralizing logs to make this data accessible.
From Incident to Prevention
The ultimate goal of RCA is not explaining what happened -- it is preventing it from happening again. Automatic RCA accelerates the path to prevention because the structured output (root cause, causal chain, affected services) maps directly to action items.
From the Redis example above, the automatic action items would be:
These action items come directly from the automated analysis, not from a 2-hour brainstorming session two weeks after the incident. They are specific, actionable, and directly tied to the failure mode that was observed.
Conclusion
Manual root cause analysis is a necessary process trapped in an outdated methodology. It takes hours, produces inconsistent results, and often concludes with "we think this is what happened." Automatic RCA does not eliminate the need for human judgment, but it eliminates the tedious correlation work that consumes most of the time. Ninety seconds of automated signal collection, dependency-aware correlation, and AI-powered analysis replaces 30-60 minutes of a human opening tabs, reading logs, and building a mental timeline.
The investment is connecting your monitoring sources and defining your dependency graph. The payoff is getting a structured root cause analysis in under two minutes instead of scheduling a postmortem meeting three days later where everyone tries to remember what happened. Your future 3am self will thank you.