Cron Pitfalls in Production — 7 Failure Modes and Their Fixes

Thundering herd at :00:00

A cron expression like 0 * * * * means "the start of every hour." When a fleet of servers all run the same cron, every server fires at exactly :00:00 of the hour. Not approximately — exactly. Clocks are NTP-synced, and cron's resolution is one minute.

What goes wrong: every server simultaneously hits the same Redis connection, the same database, the same downstream API. Pools saturate. Connections back up. A 5 ms ping becomes a 5-second timeout. If the job touches a shared cache or queue, the contention multiplies.

The signature in monitoring is unmistakable — a sharp spike at :00 of every hour, then a flat valley. Or worse, a */5 cron creating spikes every five minutes.

The fix has two flavors. On a single server, jitter the start: sleep $((RANDOM % 60)); your-job. On a build farm or fleet, use a scheduler that's aware of this — Jenkins's H/5 * * * * syntax hashes the job name to spread out firing times. Kubernetes CronJobs don't have a built-in equivalent, but a randomized offset inside the script does the job.

DST transitions

A nightly job at 0 2 * * * runs at 2 AM local time. In the United States, twice a year, "2 AM local time" becomes a problem.

On the November fall-back, clocks roll from 2 AM back to 1 AM, then forward through 2 AM a second time. Some cron implementations fire the 2 AM job twice. Others fire it once. The behavior depends on the cron daemon and the OS — Vixie cron, fcron, and systemd-timers all differ.

On the March spring-forward, clocks jump from 2 AM directly to 3 AM. The 2 AM execution window doesn't exist that day, and most cron implementations skip it entirely.

For batch jobs, double-running can corrupt data. Skipped runs leave gaps. Neither is acceptable for billing, reporting, or sync jobs. The fix is to schedule outside the DST window — anything before 1 AM or after 4 AM is safe — or to run cron in UTC. Most cloud schedulers (Kubernetes CronJob, AWS EventBridge, GitHub Actions) default to UTC for exactly this reason. "2 AM UTC" is a meaningless time to humans, but humans aren't reading cron logs at 2 AM anyway.

Overlap with no protection

A job at */5 * * * * runs every five minutes. On a normal day the job takes three minutes — fine. Then traffic spikes, the job slows down, and one execution takes seven minutes. The next scheduled run starts while the previous one is still going. Two copies of the same job, running concurrently, against the same data.

The failure modes depend on what the job touches. If it processes a queue, items get processed twice. If it updates rows in a database, you race on UPDATE WHERE status='pending'. If it writes to a shared file, you corrupt it.

The fix is concurrency control, but the right primitive depends on where the job runs. On a single server, flock -n /tmp/job.lock command exits immediately if another instance holds the lock. In Kubernetes, the CronJob spec's concurrencyPolicy: Forbid tells the controller to skip new runs while the previous one is still active. In an application-controlled cron, an advisory database lock (pg_advisory_lock in Postgres) is portable across replicas. Whatever the mechanism, the rule is: assume the job will eventually run longer than its interval. If it can't safely overlap with itself, design that in from the start.

Silent failure

Standard Vixie cron writes job output to a mailbox. If the system has MAILTO= set and a working MTA, you get email when something errors. If neither of those is configured — and on most modern Linux installs, neither is — output goes to /var/log/cron or journalctl -u cron, places nobody looks.

The result is a job that fails silently for weeks. The schedule keeps firing, the script keeps exiting 1, the next run fires anyway. The downstream system that depends on this job's output starts showing stale data. Eventually someone notices that yesterday's report is six weeks old.

The fix is layered. At minimum, redirect stderr to a known log file: command >> /var/log/myjob.log 2>&1. Better: wrap the command with an alerting layer that captures exit codes and posts to PagerDuty or Slack on failure. Best: use a deadman switch like Healthchecks.io. The cron job pings a URL on success; the URL pings you if it stops hearing from the job. The absence of success becomes the alert. For Kubernetes CronJobs, the equivalent is monitoring kube_cronjob_status_last_successful_time and alerting when the gap exceeds the expected interval.

Distributed cron without coordination

A cron rule that works on one server becomes a multiplication problem on N servers. Without coordination, every replica fires its own copy of every job. Five web servers running a 0 * * * * log-rotation cron means five log rotations, five concurrent writes, five copies of yesterday's archive.

Kubernetes CronJobs solve part of this. The controller is responsible for scheduling, and it creates a single Job per scheduled time. But the CronJob spec has caveats. concurrencyPolicy: Allow (the default) lets overlapping jobs run, which collides with pitfall 3. Forbid skips new runs while old ones are active. Replace kills the old run when a new one is due. Pick deliberately.

For non-Kubernetes deployments, the options are a dedicated cron host (one server runs scheduled jobs; others don't), leader election (the daemon runs on every replica but checks a distributed lock — etcd, ZooKeeper, Redis SETNX — before firing), or application-level scheduling (Sidekiq Cron, Celery Beat, Quartz with JDBC store all handle leader election internally). The wrong choice is letting every replica run its own cron and hoping the jobs are idempotent. They aren't. Eventually one of them won't be.

DOM vs DOW is OR, not AND

In standard cron syntax, the day-of-month and day-of-week fields are evaluated with an OR, not an AND. 0 9 1 * 1 does not mean "9 AM on the 1st of the month, if that day is a Monday." It means "9 AM on the 1st of any month, OR 9 AM on every Monday."

This is Vixie cron behavior preserved for backwards compatibility. The original Unix cron used OR; rewriting it would have broken every existing crontab. Most cron implementations preserve the convention.

The implication: never set both DOM and DOW with the intent of restricting to dates that satisfy both. Either set only one and let the other be *, or do the AND in application code. The exception is Quartz, which uses ? in one of the two fields to make the other authoritative. AWS EventBridge inherited the convention. On those platforms, 0 9 1 * ? or 0 9 ? * 1 is unambiguous.

GitHub Actions cron drift

GitHub Actions supports cron schedules in workflows. The syntax is standard 5-field cron, in UTC. What's not advertised is that execution time is not guaranteed.

GitHub's own documentation states that scheduled workflows can be delayed "during periods of high loads of GitHub Actions workflow runs." In practice, that delay ranges from tens of minutes to several hours. There are documented cases of scheduled workflows skipping entire runs when the queue backs up.

The implication: GitHub Actions cron is fine for jobs where the schedule is a hint, not a contract. Generating a weekly report, syncing a slow-moving dataset, refreshing a cache — all fine. Anything where "fires at exactly 9 AM" matters (transaction cutoffs, regulatory deadlines, customer-facing communications) needs a real scheduler. The fix isn't really a fix; it's a deployment decision. For time-critical schedules, use a Kubernetes CronJob, an AWS EventBridge rule, or a dedicated scheduling system. Don't use a CI runner.

In summary

Cron is unbeatable for what it does well: simple, repeating, well-understood jobs on a known schedule. The skill isn't avoiding cron — it's knowing when the simple model breaks down and what to do about it.

Pitfalls 1, 3, and 4 are the ones almost every team eventually hits. 2 and 7 are situational but catastrophic when they do hit. 5 and 6 are domain-specific. If a scheduled job touches data anyone cares about, design for these failure modes from the start. Adding them after an incident is much more expensive.