Cron Job Failure Alerts
Set up cron job failure alerts using dead man's switches and monitoring. Includes bash, curl, and CronSafe integration examples with alert routing.
Cron Job Failure Alerts
Monday morning, 9:14 AM. You open Slack to a message from the finance team: "The invoice export hasn't updated since Thursday." You SSH into the production server, check the cron log, and find a Postgres connection timeout from Friday at 1:02 AM. The job tried to run, failed, wrote one line to stderr, and cron moved on. No alert. No notification. The weekend passed. Three days of invoices are missing from the partner portal, and your largest client has already opened a support ticket. This scenario happens because cron treats alert routing as someone else's problem. It runs your command, captures zero context about the outcome, and offers a single notification mechanism: local mail delivery to a user mailbox that nobody has checked since 2003. If you want to know your cron job failed -- and know within minutes, not days -- you need to build the alerting layer yourself or use a service that handles it. This guide covers both approaches, from raw bash scripts to production-grade multi-channel routing with CronSafe.
Why Cron's Built-In Notification Is Useless
Cron has one notification feature: the
MAILTO variable. When set, cron emails the job's stdout and stderr to the specified address after every execution. Here is why that fails in practice:A 2025 survey by Cronitor found that 61% of teams using cron discovered at least one silent failure per quarter. The median time-to-detection for a failed cron job without monitoring was 3.2 days. That is 3.2 days of compounding damage before anyone notices.
Approach 1: Exit Code Alerting in Bash
The simplest approach that actually works is wrapping your cron command in a bash script that checks exit codes and explicitly sends failure notifications.
#!/bin/bash
/opt/scripts/backup-monitored.sh
Wraps the database backup with explicit failure alerting.
set -euo pipefail
JOB_NAME="prod-db-backup"
SLACK_WEBHOOK="https://hooks.slack.com/services/T00000/B00000/XXXX"
LOG="/var/log/cron/${JOB_NAME}.log"
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
HOSTNAME=$(hostname -f)
START_TIME=$(date +%s)
log() { echo "[${TIMESTAMP}] $1" >> "${LOG}"; }
send_failure_alert() {
local exit_code=$1
local error_msg=$2
local duration=$3
# Slack alert with structured fields
curl -s -X POST "${SLACK_WEBHOOK}" \
-H "Content-Type: application/json" \
-d "{
\"text\": \":rotating_light: Cron job '${JOB_NAME}' FAILED on ${HOSTNAME}\",
\"blocks\": [{
\"type\": \"section\",
\"text\": {
\"type\": \"mrkdwn\",
\"text\": \"*Cron Job Failure*\n*Job:* ${JOB_NAME}\n*Server:* ${HOSTNAME}\n*Exit Code:* ${exit_code}\n*Duration:* ${duration}s\n*Time:* ${TIMESTAMP}\n*Error:* \\\${error_msg}\\\\"
}
}]
}" > /dev/null 2>&1
log "ALERT SENT: exit=${exit_code} duration=${duration}s"
}
log "Starting ${JOB_NAME}..."
Capture stderr while letting the command run
ERROR_OUTPUT=$(pg_dump -Fc production \
-f "/backups/production_${TIMESTAMP}.dump" 2>&1) || {
EXIT_CODE=$?
DURATION=$(( $(date +%s) - START_TIME ))
send_failure_alert "${EXIT_CODE}" "${ERROR_OUTPUT}" "${DURATION}"
log "FAILED: exit=${EXIT_CODE}"
exit ${EXIT_CODE}
}
Upload to S3
aws s3 cp "/backups/production_${TIMESTAMP}.dump" \
s3://company-backups/postgres/ 2>&1 || {
EXIT_CODE=$?
DURATION=$(( $(date +%s) - START_TIME ))
send_failure_alert "${EXIT_CODE}" "S3 upload failed" "${DURATION}"
exit ${EXIT_CODE}
}
DURATION=$(( $(date +%s) - START_TIME ))
log "SUCCESS in ${DURATION}s"This works. But it has three structural problems:
It only alerts on failure, not on absence. If the server is down and the job never starts, this script never runs, and no alert fires. You only learn about the problem when the server comes back up (if it does).
You maintain alert logic in every script. With 15 cron jobs across 4 servers, you now have 15 copies of the alerting boilerplate. Change the Slack webhook? Edit 15 files on 4 servers.
There is no escalation. A single Slack message at 1 AM on Saturday is easily missed. You need the ability to escalate: Slack first, then email after 30 minutes, then SMS after 60 minutes if nobody acknowledges.
Approach 2: Dead Man's Switch with Explicit Fail Pings
The dead man's switch pattern solves the "absence" problem. Your job pings a monitoring endpoint on success. If the ping does not arrive within the expected window, an alert fires. But you can go further by also sending an explicit
/fail ping that triggers an immediate alert without waiting for the grace period.#!/bin/bash
/opt/scripts/backup-cronsafe.sh
Database backup with CronSafe dead man's switch + explicit fail.
set -euo pipefail
PING="https://ping.cronsafe.luxkern.com/m/abc123"
Signal start
curl -fsS "${PING}/start" --max-time 10 > /dev/null 2>&1 || true
START_TIME=$(date +%s)
Run backup
if pg_dump -Fc production -f "/backups/prod_$(date +%Y%m%d).dump"; then
# Upload to S3
if aws s3 cp /backups/prod_*.dump s3://company-backups/; then
# Clean up local backups older than 7 days
find /backups -name "*.dump" -mtime +7 -delete
DURATION=$(( $(date +%s) - START_TIME ))
# Signal success with metadata
curl -fsS "${PING}?duration=${DURATION}&msg=backup_ok" \
--max-time 10 > /dev/null 2>&1 || true
else
DURATION=$(( $(date +%s) - START_TIME ))
# Explicit fail -- immediate alert, no grace period wait
curl -fsS "${PING}/fail?duration=${DURATION}&msg=s3_upload_failed" \
--max-time 10 > /dev/null 2>&1 || true
exit 1
fi
else
DURATION=$(( $(date +%s) - START_TIME ))
curl -fsS "${PING}/fail?duration=${DURATION}&msg=pg_dump_failed" \
--max-time 10 > /dev/null 2>&1 || true
exit 1
fiThe
/fail endpoint is the key difference from a basic dead man's switch. Without it, CronSafe waits for the grace period to expire before alerting -- which might be 10 or 30 minutes depending on your configuration. With /fail, the alert fires within seconds. For jobs where every minute of delay matters (billing, security scans, compliance exports), this is the difference between a 2-minute response and a 30-minute response.The
/start ping at the top gives CronSafe another dimension of visibility. If it receives a /start but never gets a success or fail ping, it knows the job is hanging -- a failure mode that is invisible to exit-code checks.Configuring Multi-Channel Alert Routing
Alert routing is the difference between a notification that gets seen and one that drowns in a noisy channel. The right approach uses escalation tiers: start quiet, get louder if nobody responds.
Here is a complete CronSafe alert configuration using the API:
// configure-alerts.js
// Sets up tiered escalation for a CronSafe monitor.
const API = "https://api.cronsafe.luxkern.com/v1";
const API_KEY = process.env.CRONSAFE_API_KEY;
async function configureAlerts(monitorId) {
const response = await fetch(${API}/monitors/${monitorId}, {
method: "PATCH",
headers: {
"Authorization": Bearer ${API_KEY},
"Content-Type": "application/json",
},
body: JSON.stringify({
alertChannels: [
{
type: "slack",
webhookUrl: "https://hooks.slack.com/services/T.../B.../xxx",
config: {
channel: "#cron-alerts",
mentionOnFailure: "@oncall-infra",
},
},
{
type: "email",
address: "oncall@company.com",
},
],
escalationPolicy: {
tiers: [
{
// Tier 1: first missed ping or explicit /fail
afterMissedPings: 1,
channels: ["slack"],
// Fires within seconds of a /fail ping
// or after grace period for a missed ping
},
{
// Tier 2: still no recovery after 2 consecutive misses
afterMissedPings: 2,
channels: ["slack", "email"],
},
{
// Tier 3: 3 consecutive misses -- wake someone up
afterMissedPings: 3,
channels: ["slack", "email", "sms"],
smsNumber: "+33612345678",
},
],
autoResolve: true, // clear alert when next success ping arrives
cooldownMinutes: 30, // no re-alert for 30 min after resolution
},
}),
});
if (!response.ok) {
throw new Error(Config failed: ${response.status});
}
console.log(Escalation configured for monitor ${monitorId});
}
// Apply to your monitors
configureAlerts("mon_abc123");
configureAlerts("mon_def456");This escalation policy means:
#cron-alerts and mentions @oncall-infra. If someone is watching Slack, they see it.For a team with 50 cron jobs, this policy ensures that a one-off transient failure (network blip, temporary disk issue) does not page anyone at 3 AM, while a persistent failure gets progressively louder until it is handled.
Alert Channel Comparison
Different channels suit different urgency levels and team workflows:
| Channel | Latency | Best For | Limitations | |---|---|---|---| | Slack | < 5 seconds | Team awareness, first-tier alerts | Easy to miss in noisy channels | | Email | 10-60 seconds | On-call individual, audit trail | Inbox fatigue, potential spam filtering | | SMS | 5-15 seconds | High-urgency escalation | Cost per message, character limits | | PagerDuty | < 10 seconds | Incident management, rotation | Requires PagerDuty subscription | | Webhook | < 5 seconds | Custom integrations, automation | You build the receiver | | Discord | < 5 seconds | Developer teams, open-source | Not standard in enterprise |
CronSafe includes Slack, email, Discord, and webhook alerts on the free tier (20 monitors). SMS and advanced escalation require the Pro plan at EUR 9/month. Compare that to services where Slack alerts alone require a $20+/month paid plan -- the CronSafe vs Cronitor comparison breaks down pricing in detail.
Patterns for Different Job Types
Not every job needs the same alerting strategy. Here are four patterns matched to common scenarios.
Short-running jobs (under 60 seconds). Cache purges, heartbeat pings, lightweight API calls. Use a simple success ping with a tight grace period (2-3 minutes). No start ping needed because the job finishes before a start signal would add value.
# Cache purge -- runs every 5 min, completes in seconds
*/5 * * * * /opt/scripts/purge-cache.sh && \
curl -fsS --max-time 10 "https://ping.cronsafe.luxkern.com/m/cache01" > /dev/null 2>&1Long-running jobs (5+ minutes). ETL pipelines, large backups, data migrations. Use a start ping, a success ping, and an explicit fail ping. Set the grace period to 2x the worst-case duration.
Jobs that must not overlap. Billing reconciliation, sequential data processing. Add a lock file check at the top of your script. If the previous run is still active, skip this execution and optionally ping a
/skip endpoint so CronSafe knows the skip was intentional, not a failure.Jobs with partial success. Email digest sends where 950 out of 1,000 succeed. Send a success ping with metadata indicating the partial failure. Configure CronSafe to flag runs where the
failed count exceeds a threshold.#!/bin/bash
set -euo pipefail
PING="https://ping.cronsafe.luxkern.com/m/digest01"
curl -fsS "${PING}/start" --max-time 10 > /dev/null 2>&1 || true
RESULT=$(python3 /opt/scripts/send_digests.py 2>&1)
SENT=$(echo "${RESULT}" | grep -oP 'sent=\K\d+')
FAILED=$(echo "${RESULT}" | grep -oP 'failed=\K\d+')
if [ "${FAILED}" -gt 0 ]; then
# Partial success -- ping with warning metadata
curl -fsS "${PING}" -X POST \
-H "Content-Type: application/json" \
-d "{\"status\":\"warning\",\"sent\":${SENT},\"failed\":${FAILED}}" \
--max-time 10 > /dev/null 2>&1 || true
else
curl -fsS "${PING}" --max-time 10 > /dev/null 2>&1 || true
fiAvoiding Alert Fatigue
Alert fatigue is the fastest way to make your monitoring useless. If your team gets 40 cron alerts per day and 38 of them are false positives, nobody reads any of them. Here are four rules to keep your signal clean.
Set grace periods based on observed data, not guesses. If your backup takes between 3 and 12 minutes, set the grace period to 25 minutes. Do not set it to 5 minutes and then wonder why you get false alarms on the nights when the database has more data.
Use tiered escalation, not broadcast. Not every failure needs to page the on-call. A cache purge that misses one cycle is a Slack message. A database backup that misses three cycles is an SMS. Match the channel to the severity.
Mute during maintenance windows. If you are doing a planned server migration on Saturday, mute your monitors for that window. CronSafe supports per-monitor maintenance windows so you do not have to silence everything.
Review your monitor list monthly. Jobs get decommissioned but their monitors stay active, firing false alerts for jobs that no longer exist. A monthly audit of your CronSafe dashboard takes 10 minutes and prevents phantom alerts from eroding trust in the system.
The Cost of Slow Alerting
To quantify the impact: if a nightly backup fails and you detect it 3 days later (the industry median for unmonitored jobs), you lose 3 days of data recoverability. If that backup runs at 2 AM and your monitoring alerts within 5 minutes, you have 22 hours before the next business day to fix it -- usually enough to re-run the backup manually and suffer zero data loss.
The difference between "alerted in 5 minutes" and "discovered in 3 days" is the difference between a non-event and a customer-facing incident. At CronSafe's pricing (free for 20 monitors, EUR 9/month for unlimited), the cost of monitoring is negligible compared to the cost of a single undetected failure.
For a complete walkthrough of the dead man's switch pattern including architecture diagrams, code examples in bash, Python, and GitHub Actions, read the companion guide on how to monitor cron jobs.
Set Up Your First Alert in 2 Minutes
Every cron job without alerting is a ticking clock. The question is not whether it will fail silently -- it is when, and how long the silence lasts.
Create a free account at CronSafe, add a monitor matching your job's schedule, paste one
curl line into your script, and configure your alert channels. Twenty monitors. Zero cost. No credit card.