← Back to blog
cronsafe

How to Monitor Cron Jobs: The Complete Dead Man's Switch Guide

Learn how to monitor cron jobs using the dead man's switch pattern with bash, Python, and GitHub Actions. Catch silent failures before they cost you.

cron monitoringdead mans switchcron jobsdevops

How to Monitor Cron Jobs



Last Thursday at 2:31 AM, your nightly Postgres backup script threw a disk-full error, exited with code 1, and went quiet. Cron dutifully moved on to the next scheduled minute. No Slack message. No email. No page. You discovered the problem five days later when a customer asked you to restore a deleted record and the freshest dump was from Tuesday. Five days of data, gone. This story repeats across thousands of teams every month because cron has zero built-in failure detection. It fires your command and forgets about it. If you want to know that a job succeeded, you need an external watchdog -- and the proven pattern for that is the dead man's switch.

What a Dead Man's Switch Actually Does



The name comes from train engineering. A spring-loaded lever requires constant pressure from the operator. Release the lever -- because you fell asleep or worse -- and the brakes engage automatically. The system reacts to the *absence* of a signal, not the presence of one.

In cron monitoring, the lever is an HTTP request. After your job finishes successfully, it sends a GET or POST to a unique URL hosted by a monitoring service. The service starts a countdown timer. If the next expected ping does not arrive before the timer runs out, an alert fires.

This is fundamentally different from log scraping or exit-code checking. Those approaches only catch failures that produce output. A dead man's switch catches six categories of failure at once:

  • The job crashes with an unhandled exception.
  • The job hangs indefinitely and never exits.
  • The cron daemon itself is not running.
  • The server is powered off or unreachable.
  • Someone deleted the crontab entry by accident.
  • The job runs but produces corrupt output and exits 0.


  • Category 6 requires you to add a validation step before pinging, but the mechanism is the same. You only ping when the outcome is verified correct.

    According to a 2025 Datadog infrastructure report, 14% of scheduled tasks across their customer base experienced at least one silent failure per quarter. A Percona survey found that 32% of database teams discovered backup failures only when they needed to restore. Both numbers point to the same gap: jobs that fail without telling anyone.

    How the Architecture Works



    The dead man's switch pattern has three actors:

  • Your cron job -- the script doing actual work (backup, ETL, cache warm, invoice send).
  • A ping endpoint -- a unique URL you receive when you create a monitor.
  • The monitoring service -- watches incoming pings, resets a countdown timer on each one, and alerts when the timer expires.


  • The flow looks like this:

    [Cron fires] --> [Script runs] --> [Script succeeds] --> [curl hits ping URL]
                                                                  |
                                                            [Timer resets]

    If timer expires without a ping: [Monitor] --> [Alert via Slack / Email / SMS / Webhook]


    Every successful ping resets the timer. A missed ping means something went wrong. That is the entire protocol. No agents to install. No log parsers. No daemon running alongside your job. One HTTP call.

    Bash: The 3-Line Integration with curl



    Most cron jobs are shell scripts invoked from crontab. The integration is a single curl appended with &&.

    # Before monitoring -- silent failures
    0 2 * * * root /opt/scripts/backup-db.sh

    After monitoring -- CronSafe alerts if the ping is missing

    0 2 * * * root /opt/scripts/backup-db.sh && curl -fsS --retry 3 --max-time 10 https://ping.cronsafe.luxkern.com/m/abc123 > /dev/null 2>&1


    The && is the critical operator. It means "run curl only if the previous command exited with status 0." If your backup script fails, curl never executes, the ping never arrives, and CronSafe alerts you after the grace period.

    Here is a more complete version with start, success, and fail signals:

    #!/bin/bash
    

    /opt/scripts/backup-db-monitored.sh

    set -euo pipefail

    MONITOR="https://ping.cronsafe.luxkern.com/m/abc123"

    Tell CronSafe the job started

    curl -fsS "${MONITOR}/start" --max-time 10 > /dev/null 2>&1 || true

    START_TIME=$(date +%s)

    Run the actual backup

    if pg_dump -h localhost -U app production | gzip > "/backups/db_$(date +%Y%m%d).sql.gz"; then DURATION=$(( $(date +%s) - START_TIME )) # Signal success with duration metadata curl -fsS "${MONITOR}?duration=${DURATION}" --max-time 10 > /dev/null 2>&1 || true echo "Backup completed in ${DURATION}s" else DURATION=$(( $(date +%s) - START_TIME )) # Signal explicit failure -- triggers immediate alert, no waiting curl -fsS "${MONITOR}/fail?duration=${DURATION}" --max-time 10 > /dev/null 2>&1 || true echo "Backup FAILED after ${DURATION}s" >&2 exit 1 fi


    The || true after each curl is a safety net. If CronSafe itself is unreachable (network blip, DNS timeout), you do not want your backup to fail because of a monitoring side-effect. The set -e at the top would otherwise abort the script on a non-zero curl exit.

    The flags break down as: -f fails silently on HTTP errors, -s suppresses the progress bar, -S still shows error messages, --retry 3 retries transient failures, and --max-time 10 caps the request at 10 seconds so a slow network never blocks your job.

    Python: httpx Ping with Structured Error Handling



    If your scheduled tasks run inside a Python process -- whether via APScheduler, Celery Beat, or a plain while True loop -- you can wrap any function with monitoring in under 20 lines.

    """
    monitored_tasks.py
    Dead man's switch wrapper using httpx for async-friendly pinging.
    """
    import httpx
    import time
    import logging
    from datetime import datetime

    logger = logging.getLogger(__name__)

    MONITORS = { "db_backup": "https://ping.cronsafe.luxkern.com/m/abc123", "invoice_send": "https://ping.cronsafe.luxkern.com/m/def456", "cache_warm": "https://ping.cronsafe.luxkern.com/m/ghi789", }

    def ping(monitor_name: str, endpoint: str = "", duration: float = None): """Send a heartbeat to CronSafe. Never raises.""" url = f"{MONITORS[monitor_name]}{endpoint}" params = {"duration": str(int(duration))} if duration else {} try: httpx.get(url, params=params, timeout=10) except httpx.HTTPError as exc: logger.warning(f"CronSafe ping failed for {monitor_name}: {exc}")

    def monitored(monitor_name: str): """Decorator that wraps any function with start/success/fail pings.""" def decorator(func): def wrapper(*args, **kwargs): ping(monitor_name, "/start") t0 = time.monotonic() try: result = func(*args, **kwargs) elapsed = time.monotonic() - t0 ping(monitor_name, duration=elapsed) return result except Exception as exc: elapsed = time.monotonic() - t0 ping(monitor_name, "/fail", duration=elapsed) logger.error(f"Job {monitor_name} failed after {elapsed:.1f}s: {exc}") raise return wrapper return decorator

    --- Usage ---



    @monitored("db_backup") def run_backup(): """Dump production DB and upload to S3.""" import subprocess subprocess.run( ["pg_dump", "-Fc", "production", "-f", "/backups/prod.dump"], check=True, ) subprocess.run( ["aws", "s3", "cp", "/backups/prod.dump", "s3://backups/prod.dump"], check=True, ) logger.info("Backup uploaded to S3")

    @monitored("cache_warm") def warm_cache(): """Pre-compute expensive queries and store results.""" resp = httpx.get("https://api.internal/admin/cache/warm", timeout=120) resp.raise_for_status() logger.info(f"Cache warmed: {resp.json()['entries']} entries")

    if __name__ == "__main__": run_backup()


    The @monitored decorator handles start, success, and fail pings with duration tracking. You write your job logic as a normal function and never think about the monitoring plumbing again.

    Why httpx instead of requests? httpx supports async natively, has a cleaner timeout API, and drops the dependency on urllib3. For synchronous scripts like this one, either library works, but httpx is the more modern choice and weighs in at roughly 85 KB installed vs. 128 KB for requests + urllib3.

    GitHub Actions: Monitoring Scheduled Workflows



    GitHub Actions supports schedule triggers using cron syntax. What most teams do not realize is that GitHub offers no guarantee these workflows will execute on time -- or at all. Under heavy load, GitHub throttles or silently skips scheduled runs. On repositories with no recent pushes, scheduled workflows stop entirely after 60 days of inactivity.

    A dead man's switch catches both failure modes: a workflow that runs and fails, and a workflow that never starts.

    # .github/workflows/daily-sync.yml
    name: Daily Data Sync

    on: schedule: - cron: "0 6 * * *" # 6 AM UTC daily workflow_dispatch: # manual trigger for testing

    env: CRONSAFE_URL: ${{ secrets.CRONSAFE_SYNC_URL }}

    jobs: sync: runs-on: ubuntu-latest timeout-minutes: 30 steps: - name: Signal start run: curl -fsS "${CRONSAFE_URL}/start" --max-time 10 || true

    - name: Checkout uses: actions/checkout@v4

    - name: Setup Node 20 uses: actions/setup-node@v4 with: node-version: "20" cache: "npm"

    - name: Install deps run: npm ci

    - name: Run sync run: node scripts/sync-data.js env: DATABASE_URL: ${{ secrets.DATABASE_URL }}

    - name: Signal success if: success() run: curl -fsS "${CRONSAFE_URL}" --max-time 10 || true

    - name: Signal failure if: failure() run: curl -fsS "${CRONSAFE_URL}/fail" --max-time 10 || true


    The if: success() and if: failure() conditions route the correct signal based on the outcome of the sync step. Store the CronSafe URL in GitHub Secrets -- never hardcode monitoring URLs in your repository.

    Add workflow_dispatch alongside schedule so you can trigger a test run from the GitHub UI and verify the CronSafe integration works before relying on it in production.

    This pattern is especially valuable for teams that use GitHub Actions as a lightweight job scheduler. At the time of writing, GitHub's own status page shows that scheduled Actions experienced 7 partial-availability incidents in Q1 2026 alone. Without external monitoring, you would not know your workflow was skipped.

    Choosing the Right Grace Period



    The grace period is the buffer between when a ping is expected and when an alert fires. Too tight and you get false alarms from slow runs or clock drift. Too loose and a genuine failure sits undetected for hours.

    The rule of thumb: set the grace period to 2x the maximum observed execution time, plus a buffer for scheduling jitter.

    | Job Frequency | Typical Duration | Recommended Grace Period | |---|---|---| | Every minute | < 10 seconds | 2 minutes | | Every 5 minutes | < 1 minute | 5 minutes | | Every hour | < 10 minutes | 20 minutes | | Daily | < 1 hour | 2 hours | | Weekly | < 4 hours | 8 hours |

    If your job has variable execution times -- an ETL that processes 500 rows on Monday and 50,000 on month-end -- set the grace period for the worst case and add a note in CronSafe explaining why.

    For a deeper understanding of cron expressions and scheduling syntax, including the five-field format, special characters, and 20+ annotated examples, check the dedicated guide.

    Four Mistakes That Undermine Your Monitoring



    Pinging before the job finishes. A semicolon (;) runs both commands regardless of exit code. Use && or embed the curl inside your script after the success path. The difference is one character, and it is the difference between monitoring and theater.

    # Wrong -- ping fires even if backup fails
    0 2 * * * /opt/scripts/backup.sh ; curl https://ping.cronsafe.luxkern.com/m/abc123

    Right -- ping fires only on success

    0 2 * * * /opt/scripts/backup.sh && curl https://ping.cronsafe.luxkern.com/m/abc123


    Letting monitoring break the job. If CronSafe is temporarily unreachable and your curl call does not have || true, the monitoring side-effect kills your actual job. Your monitoring system should never become a single point of failure for the work it watches.

    Sharing one monitor across multiple jobs. If two jobs ping the same monitor, a success from one masks a failure in the other. Each job gets its own monitor, its own URL, its own schedule.

    Ignoring "unknown" monitors. Check your CronSafe dashboard weekly. Monitors stuck in "unknown" state mean a job was set up but never pinged, or a crontab entry was removed but the monitor was not. Both are signals worth investigating.

    When to Look Beyond Dead Man's Switches



    The dead man's switch pattern handles 90% of cron monitoring needs. But there are scenarios where you need more:

    If your job runs but produces incorrect output (a backup that exports zero rows, a report with stale data), you need validation logic before the ping. Add assertions in your script that verify the output before signaling success.

    If you need to correlate cron job outcomes with application metrics (did the ETL affect API latency?), pair CronSafe with an observability platform and send structured logs alongside your pings.

    If you need alerting on cron job *duration trends* -- a backup that took 4 minutes last month now takes 40 -- CronSafe's duration tracking catches this automatically when you send the duration query parameter with your pings.

    For detailed guidance on setting up failure alerts with escalation policies and multi-channel routing, the companion guide walks through Slack, email, PagerDuty, and SMS configurations.

    Start Monitoring in 60 Seconds



    Every unmonitored cron job is a silent failure waiting to happen. The dead man's switch pattern costs you one HTTP call per job run and catches every failure mode -- including the ones where nothing happens at all.

    Sign up at CronSafe, create a monitor, paste a curl line into your script, and stop learning about failures from your customers. The free tier covers 20 monitors. No credit card required.