January 5, 2026cronsafe

Cron Job Failure Alerts

Set up cron job failure alerts using dead man's switches and monitoring. Includes bash, curl, and CronSafe integration examples with alert routing.

cronmonitoringalertsdevopsdead man's switch

Cron Job Failure Alerts

Monday morning, 9:14 AM. You open Slack to a message from the finance team: "The invoice export hasn't updated since Thursday." You SSH into the production server, check the cron log, and find a Postgres connection timeout from Friday at 1:02 AM. The job tried to run, failed, wrote one line to stderr, and cron moved on. No alert. No notification. The weekend passed. Three days of invoices are missing from the partner portal, and your largest client has already opened a support ticket. This scenario happens because cron treats alert routing as someone else's problem. It runs your command, captures zero context about the outcome, and offers a single notification mechanism: local mail delivery to a user mailbox that nobody has checked since 2003. If you want to know your cron job failed -- and know within minutes, not days -- you need to build the alerting layer yourself or use a service that handles it. This guide covers both approaches, from raw bash scripts to production-grade multi-channel routing with CronSafe.

Why Cron's Built-In Notification Is Useless

Cron has one notification feature: the MAILTO variable. When set, cron emails the job's stdout and stderr to the specified address after every execution. Here is why that fails in practice:

Volume. A job running every 5 minutes generates 288 emails per day. Even if you filter them, the signal-to-noise ratio is terrible. After a week, you stop reading them.

No context. The email contains raw output. No exit code. No execution duration. No hostname unless your script prints it. You get a wall of text with no metadata.

No routing. MAILTO sends to one address. You cannot route critical failures to PagerDuty and low-priority warnings to Slack. Every job, every outcome, same destination.

Delivery depends on local MTA. Most modern servers do not have sendmail or postfix configured. MAILTO silently does nothing if there is no mail transfer agent.

It only fires when the job runs. If the cron daemon crashes, the server goes down, or someone deletes the crontab entry, MAILTO never triggers because the job never executes.

A 2025 survey by Cronitor found that 61% of teams using cron discovered at least one silent failure per quarter. The median time-to-detection for a failed cron job without monitoring was 3.2 days. That is 3.2 days of compounding damage before anyone notices.

Approach 1: Exit Code Alerting in Bash

The simplest approach that actually works is wrapping your cron command in a bash script that checks exit codes and explicitly sends failure notifications.

#!/bin/bash
/opt/scripts/backup-monitored.sh
Wraps the database backup with explicit failure alerting.


set -euo pipefail

JOB_NAME="prod-db-backup"
SLACK_WEBHOOK="https://hooks.slack.com/services/T00000/B00000/XXXX"
LOG="/var/log/cron/${JOB_NAME}.log"
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
HOSTNAME=$(hostname -f)
START_TIME=$(date +%s)

log() { echo "[${TIMESTAMP}] $1" >> "${LOG}"; }

send_failure_alert() {
    local exit_code=$1
    local error_msg=$2
    local duration=$3

    # Slack alert with structured fields
    curl -s -X POST "${SLACK_WEBHOOK}" \
        -H "Content-Type: application/json" \
        -d "{
            \"text\": \":rotating_light: Cron job '${JOB_NAME}' FAILED on ${HOSTNAME}\",
            \"blocks\": [{
                \"type\": \"section\",
                \"text\": {
                    \"type\": \"mrkdwn\",
                    \"text\": \"*Cron Job Failure*\n*Job:* ${JOB_NAME}\n*Server:* ${HOSTNAME}\n*Exit Code:* ${exit_code}\n*Duration:* ${duration}s\n*Time:* ${TIMESTAMP}\n*Error:* \\\${error_msg}\\\\"
                }
            }]
        }" > /dev/null 2>&1

    log "ALERT SENT: exit=${exit_code} duration=${duration}s"
}

log "Starting ${JOB_NAME}..."

Capture stderr while letting the command run
ERROR_OUTPUT=$(pg_dump -Fc production \
    -f "/backups/production_${TIMESTAMP}.dump" 2>&1) || {
    EXIT_CODE=$?
    DURATION=$(( $(date +%s) - START_TIME ))
    send_failure_alert "${EXIT_CODE}" "${ERROR_OUTPUT}" "${DURATION}"
    log "FAILED: exit=${EXIT_CODE}"
    exit ${EXIT_CODE}
}

Upload to S3
aws s3 cp "/backups/production_${TIMESTAMP}.dump" \
    s3://company-backups/postgres/ 2>&1 || {
    EXIT_CODE=$?
    DURATION=$(( $(date +%s) - START_TIME ))
    send_failure_alert "${EXIT_CODE}" "S3 upload failed" "${DURATION}"
    exit ${EXIT_CODE}
}

DURATION=$(( $(date +%s) - START_TIME ))
log "SUCCESS in ${DURATION}s"

This works. But it has three structural problems:

It only alerts on failure, not on absence. If the server is down and the job never starts, this script never runs, and no alert fires. You only learn about the problem when the server comes back up (if it does).

You maintain alert logic in every script. With 15 cron jobs across 4 servers, you now have 15 copies of the alerting boilerplate. Change the Slack webhook? Edit 15 files on 4 servers.

There is no escalation. A single Slack message at 1 AM on Saturday is easily missed. You need the ability to escalate: Slack first, then email after 30 minutes, then SMS after 60 minutes if nobody acknowledges.

Approach 2: Dead Man's Switch with Explicit Fail Pings

The dead man's switch pattern solves the "absence" problem. Your job pings a monitoring endpoint on success. If the ping does not arrive within the expected window, an alert fires. But you can go further by also sending an explicit /fail ping that triggers an immediate alert without waiting for the grace period.

#!/bin/bash
/opt/scripts/backup-cronsafe.sh
Database backup with CronSafe dead man's switch + explicit fail.


set -euo pipefail

PING="https://ping.cronsafe.luxkern.com/m/abc123"

Signal start
curl -fsS "${PING}/start" --max-time 10 > /dev/null 2>&1 || true

START_TIME=$(date +%s)

Run backup
if pg_dump -Fc production -f "/backups/prod_$(date +%Y%m%d).dump"; then
    # Upload to S3
    if aws s3 cp /backups/prod_*.dump s3://company-backups/; then
        # Clean up local backups older than 7 days
        find /backups -name "*.dump" -mtime +7 -delete

        DURATION=$(( $(date +%s) - START_TIME ))
        # Signal success with metadata
        curl -fsS "${PING}?duration=${DURATION}&msg=backup_ok" \
            --max-time 10 > /dev/null 2>&1 || true
    else
        DURATION=$(( $(date +%s) - START_TIME ))
        # Explicit fail -- immediate alert, no grace period wait
        curl -fsS "${PING}/fail?duration=${DURATION}&msg=s3_upload_failed" \
            --max-time 10 > /dev/null 2>&1 || true
        exit 1
    fi
else
    DURATION=$(( $(date +%s) - START_TIME ))
    curl -fsS "${PING}/fail?duration=${DURATION}&msg=pg_dump_failed" \
        --max-time 10 > /dev/null 2>&1 || true
    exit 1
fi

The /fail endpoint is the key difference from a basic dead man's switch. Without it, CronSafe waits for the grace period to expire before alerting -- which might be 10 or 30 minutes depending on your configuration. With /fail, the alert fires within seconds. For jobs where every minute of delay matters (billing, security scans, compliance exports), this is the difference between a 2-minute response and a 30-minute response.

The /start ping at the top gives CronSafe another dimension of visibility. If it receives a /start but never gets a success or fail ping, it knows the job is hanging -- a failure mode that is invisible to exit-code checks.

Configuring Multi-Channel Alert Routing

Alert routing is the difference between a notification that gets seen and one that drowns in a noisy channel. The right approach uses escalation tiers: start quiet, get louder if nobody responds.

Here is a complete CronSafe alert configuration using the API:

// configure-alerts.js
// Sets up tiered escalation for a CronSafe monitor.

const API = "https://api.cronsafe.luxkern.com/v1";
const API_KEY = process.env.CRONSAFE_API_KEY;

async function configureAlerts(monitorId) {
  const response = await fetch(${API}/monitors/${monitorId}, {
    method: "PATCH",
    headers: {
      "Authorization": Bearer ${API_KEY},
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      alertChannels: [
        {
          type: "slack",
          webhookUrl: "https://hooks.slack.com/services/T.../B.../xxx",
          config: {
            channel: "#cron-alerts",
            mentionOnFailure: "@oncall-infra",
          },
        },
        {
          type: "email",
          address: "oncall@company.com",
        },
      ],
      escalationPolicy: {
        tiers: [
          {
            // Tier 1: first missed ping or explicit /fail
            afterMissedPings: 1,
            channels: ["slack"],
            // Fires within seconds of a /fail ping
            // or after grace period for a missed ping
          },
          {
            // Tier 2: still no recovery after 2 consecutive misses
            afterMissedPings: 2,
            channels: ["slack", "email"],
          },
          {
            // Tier 3: 3 consecutive misses -- wake someone up
            afterMissedPings: 3,
            channels: ["slack", "email", "sms"],
            smsNumber: "+33612345678",
          },
        ],
        autoResolve: true,       // clear alert when next success ping arrives
        cooldownMinutes: 30,     // no re-alert for 30 min after resolution
      },
    }),
  });

  if (!response.ok) {
    throw new Error(Config failed: ${response.status});
  }
  console.log(Escalation configured for monitor ${monitorId});
}

// Apply to your monitors
configureAlerts("mon_abc123");
configureAlerts("mon_def456");

This escalation policy means:

A single failure at 2 AM sends a Slack message to #cron-alerts and mentions @oncall-infra. If someone is watching Slack, they see it.

If the next scheduled run also fails (two consecutive misses), the alert escalates to email. The on-call engineer gets an inbox notification even if they missed Slack.

If three runs fail in a row, CronSafe sends an SMS. At this point, a database backup has been missing for 3 days and someone needs to act.

For a team with 50 cron jobs, this policy ensures that a one-off transient failure (network blip, temporary disk issue) does not page anyone at 3 AM, while a persistent failure gets progressively louder until it is handled.

Alert Channel Comparison

Different channels suit different urgency levels and team workflows:

| Channel | Latency | Best For | Limitations | |---|---|---|---| | Slack | < 5 seconds | Team awareness, first-tier alerts | Easy to miss in noisy channels | | Email | 10-60 seconds | On-call individual, audit trail | Inbox fatigue, potential spam filtering | | SMS | 5-15 seconds | High-urgency escalation | Cost per message, character limits | | PagerDuty | < 10 seconds | Incident management, rotation | Requires PagerDuty subscription | | Webhook | < 5 seconds | Custom integrations, automation | You build the receiver | | Discord | < 5 seconds | Developer teams, open-source | Not standard in enterprise |

CronSafe includes Slack, email, Discord, and webhook alerts on the free tier (20 monitors). SMS and advanced escalation require the Pro plan at EUR 9/month. Compare that to services where Slack alerts alone require a $20+/month paid plan -- the CronSafe vs Cronitor comparison breaks down pricing in detail.

Patterns for Different Job Types

Not every job needs the same alerting strategy. Here are four patterns matched to common scenarios.

Short-running jobs (under 60 seconds). Cache purges, heartbeat pings, lightweight API calls. Use a simple success ping with a tight grace period (2-3 minutes). No start ping needed because the job finishes before a start signal would add value.

# Cache purge -- runs every 5 min, completes in seconds
*/5 * * * * /opt/scripts/purge-cache.sh && \
  curl -fsS --max-time 10 "https://ping.cronsafe.luxkern.com/m/cache01" > /dev/null 2>&1

Long-running jobs (5+ minutes). ETL pipelines, large backups, data migrations. Use a start ping, a success ping, and an explicit fail ping. Set the grace period to 2x the worst-case duration.

Jobs that must not overlap. Billing reconciliation, sequential data processing. Add a lock file check at the top of your script. If the previous run is still active, skip this execution and optionally ping a /skip endpoint so CronSafe knows the skip was intentional, not a failure.

Jobs with partial success. Email digest sends where 950 out of 1,000 succeed. Send a success ping with metadata indicating the partial failure. Configure CronSafe to flag runs where the failed count exceeds a threshold.

#!/bin/bash
set -euo pipefail

PING="https://ping.cronsafe.luxkern.com/m/digest01"
curl -fsS "${PING}/start" --max-time 10 > /dev/null 2>&1 || true

RESULT=$(python3 /opt/scripts/send_digests.py 2>&1)
SENT=$(echo "${RESULT}" | grep -oP 'sent=\K\d+')
FAILED=$(echo "${RESULT}" | grep -oP 'failed=\K\d+')

if [ "${FAILED}" -gt 0 ]; then
    # Partial success -- ping with warning metadata
    curl -fsS "${PING}" -X POST \
        -H "Content-Type: application/json" \
        -d "{\"status\":\"warning\",\"sent\":${SENT},\"failed\":${FAILED}}" \
        --max-time 10 > /dev/null 2>&1 || true
else
    curl -fsS "${PING}" --max-time 10 > /dev/null 2>&1 || true
fi

Avoiding Alert Fatigue

Alert fatigue is the fastest way to make your monitoring useless. If your team gets 40 cron alerts per day and 38 of them are false positives, nobody reads any of them. Here are four rules to keep your signal clean.

Set grace periods based on observed data, not guesses. If your backup takes between 3 and 12 minutes, set the grace period to 25 minutes. Do not set it to 5 minutes and then wonder why you get false alarms on the nights when the database has more data.

Use tiered escalation, not broadcast. Not every failure needs to page the on-call. A cache purge that misses one cycle is a Slack message. A database backup that misses three cycles is an SMS. Match the channel to the severity.

Mute during maintenance windows. If you are doing a planned server migration on Saturday, mute your monitors for that window. CronSafe supports per-monitor maintenance windows so you do not have to silence everything.

Review your monitor list monthly. Jobs get decommissioned but their monitors stay active, firing false alerts for jobs that no longer exist. A monthly audit of your CronSafe dashboard takes 10 minutes and prevents phantom alerts from eroding trust in the system.

The Cost of Slow Alerting

To quantify the impact: if a nightly backup fails and you detect it 3 days later (the industry median for unmonitored jobs), you lose 3 days of data recoverability. If that backup runs at 2 AM and your monitoring alerts within 5 minutes, you have 22 hours before the next business day to fix it -- usually enough to re-run the backup manually and suffer zero data loss.

The difference between "alerted in 5 minutes" and "discovered in 3 days" is the difference between a non-event and a customer-facing incident. At CronSafe's pricing (free for 20 monitors, EUR 9/month for unlimited), the cost of monitoring is negligible compared to the cost of a single undetected failure.

For a complete walkthrough of the dead man's switch pattern including architecture diagrams, code examples in bash, Python, and GitHub Actions, read the companion guide on how to monitor cron jobs.

Set Up Your First Alert in 2 Minutes

Every cron job without alerting is a ticking clock. The question is not whether it will fail silently -- it is when, and how long the silence lasts.

Create a free account at CronSafe, add a monitor matching your job's schedule, paste one curl line into your script, and configure your alert channels. Twenty monitors. Zero cost. No credit card.