Scheduled Jobs and Cron at Scale: Reliable Recurring Tasks in Symfony and Node

#cron at scale
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

Every team eventually discovers the same thing about cron: it works well until the moment it matters most. A nightly billing summary runs on a single application server, that server is replaced during a rolling deploy, the job silently does not run, and no one notices until a customer calls to ask why their invoice is wrong. Or a heavy report job takes longer than its schedule interval, a second instance fires while the first is still running, and the two instances corrupt shared state in the database.

Cron at scale is not about running more jobs. It is about building a scheduling layer that treats every recurring task as something that can and will fail - and designing the system so that failure is visible, recoverable, and never leads to duplicate or missing work. This post covers how to do that in both Symfony and Node, using patterns that hold up past a single server.

Why Single-Server Cron Breaks

The standard approach - a crontab entry calling a console command or a Node script - has two fundamental problems in any environment with more than one application instance.

The first is the birthday problem of scheduled tasks. If your autoscaling group runs three identical servers and each has the same crontab, every scheduled job fires three times simultaneously. Database writes that assume exclusivity now race each other. Emails go out in triplicate. Reports contain partial results because two instances are reading the same tables and writing to the same output simultaneously.

The second is invisibility. A crontab that fails does nothing more than write to a syslog file that no one reads. There is no dead-letter queue, no retry, no alert. A critical weekly cleanup job can stop working on the day a deployment changes the server's environment variables, and you will not know until weeks later when someone notices the database has not been cleaned.

Both problems have well-understood solutions, but they require stepping away from raw cron and treating scheduled work as a proper engineering concern.

Symfony: Symfony Scheduler and Distributed Locking

Symfony 6.3 shipped a Scheduler component that integrates directly with Messenger. Instead of a crontab entry, you define a recurring message and its schedule in PHP, dispatch it through Messenger, and let workers pick it up. This gives you all the retry, dead-letter, and observability tooling that Messenger already provides for queued jobs.

A basic recurring task looks like this:

// src/Scheduler/ReportSchedule.php
use Symfony\Component\Scheduler\Attribute\AsSchedule;
use Symfony\Component\Scheduler\RecurringMessage;
use Symfony\Component\Scheduler\Schedule;
use Symfony\Component\Scheduler\ScheduleProviderInterface;

#[AsSchedule('reports')]
final class ReportSchedule implements ScheduleProviderInterface
{
    public function getSchedule(): Schedule
    {
        return (new Schedule())->add(
            RecurringMessage::cron('0 3 * * *', new GenerateNightlyReport()),
        );
    }
}

Running messenger:consume scheduler_reports on any worker picks up the schedule. You can run as many workers as you like - Messenger's transport ensures each message is delivered once.

The catch is at-least-once delivery. If your handler is not idempotent and a job takes longer than expected or a worker crashes mid-run, you can end up with the same task running twice. The solution is to acquire a distributed lock at the start of every handler that touches shared state:

use Symfony\Component\Lock\LockFactory;

#[AsMessageHandler]
final class GenerateNightlyReportHandler
{
    public function __construct(
        private LockFactory $lockFactory,
        private ReportGenerator $generator,
    ) {}

    public function __invoke(GenerateNightlyReport $message): void
    {
        $lock = $this->lockFactory->createLock(
            'nightly-report-' . $message->date->format('Y-m-d'),
            ttl: 3600,
        );

        if (!$lock->acquire()) {
            return; // Another worker already has this
        }

        try {
            $this->generator->generate($message->date);
        } finally {
            $lock->release();
        }
    }
}

The lock key includes the date so two runs on different dates are independent, but two runs on the same date for the same report are serialized. The TTL ensures a crashed worker does not hold the lock forever.

For the lock store, Redis with the RedisStore adapter works well for most teams. If you want the lock to survive a Redis restart, use the CombinedStore with two independent Redis nodes or fall back to the Doctrine lock store, which uses a database row and transactions to guarantee correctness.

Node.js: node-cron, BullMQ, and the Scheduler Pattern

In Node, the equivalent path runs through BullMQ. Raw node-cron has the same single-server problem as crontab: every instance schedules its own timers, and you are back to triple-firing on a three-node cluster.

BullMQ's repeatable jobs solve this by storing the schedule state in Redis. You register a job once, and BullMQ's internal scheduler ensures it fires exactly once at each interval regardless of how many workers are running:

import { Queue, Worker } from 'bullmq';
import IORedis from 'ioredis';

const connection = new IORedis(process.env.REDIS_URL);

const reportQueue = new Queue('reports', { connection });

// Register the repeatable job once at startup
// BullMQ is idempotent here: re-registering the same job with the same key
// updates the schedule rather than creating a duplicate.
await reportQueue.add(
  'nightly-report',
  { type: 'nightly' },
  {
    repeat: { cron: '0 3 * * *' },
    jobId: 'nightly-report-singleton', // stable key prevents duplicates
  }
);

const worker = new Worker(
  'reports',
  async (job) => {
    await generateNightlyReport(job.data);
  },
  { connection }
);

BullMQ stores the next scheduled execution time in a Redis sorted set keyed by the job's repeat configuration. When a repeatable job fires, BullMQ atomically removes the current entry and inserts the next one. Multiple workers competing to pick up the job use Redis atomic operations to ensure exactly one wins.

For tasks where you need stronger guarantees than Redis provides - where losing an in-flight job due to a Redis restart is unacceptable - the pattern is to use BullMQ for scheduling but persist the intent to the database before the job runs:

const worker = new Worker(
  'reports',
  async (job) => {
    // Write intent to DB first - if worker dies, you have a record
    const run = await db.scheduledRun.upsert({
      where: { jobKey: job.id! },
      create: { jobKey: job.id!, status: 'running', startedAt: new Date() },
      update: { status: 'running', startedAt: new Date() },
    });

    try {
      await generateNightlyReport(job.data);
      await db.scheduledRun.update({
        where: { id: run.id },
        data: { status: 'completed', completedAt: new Date() },
      });
    } catch (error) {
      await db.scheduledRun.update({
        where: { id: run.id },
        data: { status: 'failed', error: String(error) },
      });
      throw error; // Let BullMQ handle retry
    }
  },
  { connection }
);

This gives you a queryable audit trail of every scheduled job run, its outcome, and its timing - independent of whatever BullMQ stores in Redis.

Observability: The Missing Layer

The most common gap in scheduled job infrastructure is observability. Teams set up a job, confirm it ran once in development, deploy, and move on. Three months later they discover it has been failing silently for six weeks.

The minimum viable observability setup has three parts.

Job heartbeats tell your monitoring system that a job ran successfully. A simple approach is to use a service like Healthchecks.io or a custom endpoint: at the end of every successful job run, make an HTTP GET to a URL unique to that job. If the URL goes uncalled for longer than the expected interval plus a grace period, you get an alert. This catches both job failures and jobs that stop scheduling entirely.

Structured logging with duration gives you the data to understand how job performance changes over time:

// Symfony example
$this->logger->info('scheduled_job.completed', [
    'job' => 'nightly-report',
    'date' => $message->date->format('Y-m-d'),
    'duration_ms' => (int) ((hrtime(true) - $startTime) / 1e6),
    'records_processed' => $count,
]);

When a job that used to run in 12 minutes starts taking 45, you want to know before it exceeds its scheduling interval and starts overlapping with the next run.

Dead-letter queue monitoring catches handlers that exhaust retries. In Symfony Messenger, configure a failure_transport and alert when messages land in it. In BullMQ, listen to the failed event and write failed jobs to a monitoring destination:

worker.on('failed', (job, error) => {
  logger.error('scheduled_job.failed', {
    jobName: job?.name,
    jobId: job?.id,
    attemptsMade: job?.attemptsMade,
    error: error.message,
    stack: error.stack,
  });
  metrics.increment('scheduled_job.failure', { job: job?.name });
});

Handling Long-Running Jobs Without Disruption

Scheduled jobs that approach or exceed their scheduling interval need careful handling to avoid overlap. The distributed lock approach above is the right defense, but you also need a strategy for the job itself to handle being stopped mid-run - especially during deployments.

In Symfony, workers receive a SIGTERM during graceful shutdown. The Messenger worker respects this by default: it finishes the current message and then stops. For jobs that process large batches, break the work into checkpointed chunks and commit progress to the database between chunks. If the worker is killed and the job is requeued, it can resume from the last committed checkpoint rather than starting over.

In Node, BullMQ workers also handle graceful shutdown, but you need to explicitly close the worker and wait for in-flight jobs:

process.on('SIGTERM', async () => {
  await worker.close();
  await connection.quit();
  process.exit(0);
});

For jobs that genuinely need to run longer than a deployment window allows, the pattern is to make the work idempotent with checkpointing and rely on the distributed lock to ensure the new worker instance takes over cleanly when the old one exits.

Moving Past Single-Server Thinking

The mental shift required is to stop thinking about scheduled jobs as "scripts that run on a server" and start thinking about them as "messages that are produced on a schedule and consumed by workers." That reframe immediately makes the right patterns visible: you need a durable message store, idempotent handlers, distributed coordination for exclusivity, and observability on every execution.

If your application is at the stage where this complexity feels premature, a pragmatic middle ground is to designate a single "scheduler" instance in your infrastructure - a small always-on server or container whose only job is to fire scheduled tasks. This sidesteps the multi-instance duplication problem without requiring full distributed architecture, and it is a common choice for teams that are not yet running at the scale where the full solution is warranted.

When you do need the full solution, the tools in Symfony Messenger with the Scheduler component and in BullMQ with repeatable jobs cover the majority of production use cases. The investment is not large, and the alternative - silent failures in work that your customers depend on - is a cost that compounds over time.

If you are evaluating whether your current job infrastructure is production-ready, or designing a new system that needs to handle recurring work reliably, reach out at hello@wolf-tech.io or visit wolf-tech.io. The patterns above are the kind of thing we assess in a code quality review and implement as part of custom software development engagements.

FAQ

Can I keep using cron if I only have one server? Yes - cron on a single designated server avoids the duplication problem. The risk is that your cron server is also the single point of failure for all scheduled work. If it restarts during maintenance, jobs can be missed. The heartbeat monitoring pattern applies regardless of how many servers you have.

What is the overhead of adding distributed locking to every handler? For most jobs, a Redis lock acquisition is a single network round-trip - microseconds. The overhead is negligible compared to the work most jobs do. For very high-frequency jobs running thousands of times per minute, evaluate whether scheduling is the right pattern at all or whether event-driven triggering is a better fit.

Should Symfony Scheduler replace all my existing crontab entries? Not necessarily all at once. Migrating critical jobs - billing, reports, data exports - to Scheduler-backed Messenger tasks gives you the most benefit. Administrative cleanup tasks that are low-risk if they occasionally miss a run can stay in crontab until it makes sense to migrate them.