Background Job Architecture: Queues, Workers, and Failure Handling

April 21, 2026#background job architecture

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

Every production incident I have been called into that involves "silent failures" traces back to the same place: a background job system that looked fine on the dashboard but was quietly dropping work for weeks. The pattern is almost identical each time. A worker crashes mid-task, the message is redelivered, the handler is not idempotent, and a customer ends up with two invoices—or none. A retry storm against a fragile third-party API saturates an outbound IP and trips rate limiting for the entire application. A scheduled job produces 200,000 queued items in five minutes, and the default FIFO queue means the urgent password reset emails are stuck behind a mass marketing blast.

A solid background job architecture is not about picking a queue library. It is about designing for the failure modes you will encounter in production: crashes, timeouts, poison messages, consumer saturation, and operator error. This post walks through the decisions that matter, with concrete Symfony Messenger examples for teams running PHP in production.

Why Background Job Architecture Is Load-Bearing Infrastructure

Any meaningful web application eventually needs to do work outside the request-response cycle. Sending transactional email, generating PDF invoices, syncing with CRM systems, processing image uploads, running scheduled exports—all of these belong in background workers rather than blocking an HTTP response.

The temptation is to treat this as a solved problem: install a job library, point it at Redis, call dispatch(), and move on. That works for a prototype. It breaks in production the moment one of the following happens—and eventually, all of them do. A consumer runs out of memory and is killed by the kernel while holding a reservation on a message. A Redis restart loses in-flight messages that were not persisted. A third-party webhook receiver returns 503 for an hour, and naive retries amplify the outage into your own system. A deployment ships a broken handler that throws on every message, and within minutes the queue has millions of retries scheduled.

The teams that avoid these incidents did not get lucky. They made deliberate architectural choices that are straightforward to adopt if you know what to look for.

Choosing the Right Broker for Your Load Profile

The first decision is which transport to run jobs through. The three practical options for most European SaaS teams are Redis, RabbitMQ, and a database-backed queue using the Doctrine transport. Each has a different failure profile, and the right choice depends on your durability and throughput requirements.

Redis is fast, simple, and the right choice for high-volume, low-stakes work—image resizing, cache warming, analytics pipeline feeding. What Redis does not give you without careful configuration is durability. A crashed Redis node with appendonly no loses every message in memory. Even with AOF persistence, a message can be lost between enqueue and fsync. For work you cannot afford to drop, Redis alone is insufficient.

RabbitMQ is the pragmatic default for business-critical async work. Persistent queues survive broker restarts, publisher confirms give you at-least-once delivery semantics, and the management UI is useful for operators inspecting queue depth and consumer health. The operational overhead is higher than Redis—you are running an Erlang application that needs its own monitoring and capacity planning—but the durability guarantees justify it for most production use cases.

The Doctrine transport, which writes messages to a table in your main database, is an underrated option for small-to-medium teams. Throughput is limited, but you get transactional enqueue for free: a message dispatched in the same database transaction as the business logic either both commit or both roll back. No separate broker means no separate failure mode to monitor. For teams processing thousands rather than millions of jobs per day, this simplicity is often the right trade-off.

If your application is mixing all three load profiles—some jobs are fire-and-forget analytics, some are financial transactions that must never be lost—the right answer is not to pick one transport but to route different message classes to different transports based on their durability requirements.

Designing Handlers That Survive Retries

The single most important property of a background job handler is idempotency: running it twice must produce the same observable outcome as running it once. Idempotency is non-negotiable because at-least-once delivery is the best your broker can offer. A worker that processes a message, writes the result to the database, and then crashes before acknowledging to the broker will see that message again on restart. If your handler sends an email every time it runs, the customer gets two. If your handler charges a credit card every time it runs, you have a much bigger problem.

Implementing idempotency usually means one of two patterns. The first is using a natural idempotency key from the business domain—an order ID, a webhook event ID, a user action UUID—and checking whether the outcome already exists before acting:

#[AsMessageHandler]
final class ProcessPaymentHandler
{
    public function __construct(
        private PaymentRepository $payments,
        private StripeClient $stripe,
    ) {}

    public function __invoke(ProcessPayment $message): void
    {
        // Idempotency: skip if we already have a charge for this order
        $existing = $this->payments->findByOrderId($message->orderId);
        if ($existing !== null && $existing->isCaptured()) {
            return;
        }

        $charge = $this->stripe->charges->create([
            'amount' => $message->amountCents,
            'currency' => $message->currency,
            'idempotency_key' => sprintf('order-%d', $message->orderId),
        ]);

        $this->payments->recordCharge($message->orderId, $charge->id);
    }
}

The second pattern is storing a processed-message log with unique constraints, so inserting the same message ID twice fails at the database level and the handler can treat it as a no-op. Both approaches work; the key is that the decision is explicit, not accidental.

The second property that matters is bounded execution time. A handler that can hang indefinitely on a slow HTTP call will eventually exhaust your worker pool. Every external call in a handler should have an explicit timeout set at the client level, and handlers should generally target completing within tens of seconds. Jobs that genuinely need to run for minutes should be broken into smaller steps that each complete quickly.

Retry Strategy: Exponential Backoff, Not Infinite Loops

A failed job should be retried—but not immediately, and not forever. Symfony Messenger exposes a retry strategy per transport that handles both concerns correctly by default when configured:

framework:
    messenger:
        transports:
            async_critical:
                dsn: '%env(MESSENGER_TRANSPORT_DSN)%'
                retry_strategy:
                    max_retries: 5
                    delay: 2000
                    multiplier: 3
                    max_delay: 600000
                failure_transport: failed

This configuration retries up to five times, starting with a 2-second delay that triples each attempt (2s, 6s, 18s, 54s, 162s), capped at 10 minutes. The exponential backoff gives transient failures time to recover without amplifying the problem, and the retry ceiling ensures a truly broken message eventually escapes the retry loop into the dead-letter queue rather than burning CPU forever.

A common mistake is treating all exceptions the same. A NetworkException from an API call should probably be retried—the remote service might be briefly unavailable. A ValidationException because the message contains data that was deleted from the database should probably not—no amount of waiting will make the referenced record reappear. Messenger supports this distinction by letting handlers throw UnrecoverableMessageHandlingException to send a message directly to the dead-letter queue without retrying:

if ($order === null) {
    throw new UnrecoverableMessageHandlingException(
        sprintf('Order %d no longer exists', $message->orderId)
    );
}

Classifying exceptions this way prevents retry storms on messages that will never succeed and keeps the dead-letter queue focused on genuinely actionable failures.

Dead-Letter Queues Are Not Optional

A dead-letter queue (DLQ) is where messages go when they have exhausted their retry budget or been classified as unrecoverable. Operating one is the difference between silent data loss and a diagnosable, recoverable failure.

In practice, three things need to be in place. First, every transport must have a failure_transport configured so exhausted messages are captured rather than discarded. Second, the DLQ must be monitored—a growing dead-letter queue is a production alert, not a dashboard curiosity. Third, operators need tooling to inspect and reprocess failed messages:

# Inspect failed messages
bin/console messenger:failed:show

# Show a specific failure in detail
bin/console messenger:failed:show 42 -vv

# Retry a failed message after fixing the underlying issue
bin/console messenger:failed:retry 42

# Remove a message that cannot be reprocessed
bin/console messenger:failed:remove 42

A DLQ without a human process to review it is worse than no DLQ at all—it builds false confidence while data silently accumulates in a queue nobody reads. Establishing a weekly or daily DLQ review as part of the engineering rotation is as important as the infrastructure itself.

Prioritization: Don't Let Marketing Blasts Starve Password Resets

The default assumption of a single queue processed FIFO breaks down the moment a low-priority job produces high volume. Running NewsletterSender against 200,000 subscribers should not delay the password reset email that a user is waiting for right now.

The fix is priority queues: route critical messages to a transport that has dedicated consumer capacity and shorter queue depth. In Symfony Messenger, this is a routing configuration:

framework:
    messenger:
        transports:
            critical: '%env(MESSENGER_CRITICAL_DSN)%'
            default:  '%env(MESSENGER_DEFAULT_DSN)%'
            bulk:     '%env(MESSENGER_BULK_DSN)%'

        routing:
            'App\Message\PasswordResetEmail':     critical
            'App\Message\OrderConfirmation':      critical
            'App\Message\SyncCrmContact':         default
            'App\Message\MarketingBlastSegment':  bulk

Run dedicated worker processes per transport so a flood of bulk work cannot block critical consumers:

bin/console messenger:consume critical --limit=1000 --memory-limit=256M
bin/console messenger:consume default  --limit=1000 --memory-limit=256M
bin/console messenger:consume bulk     --limit=1000 --memory-limit=256M

This separation is cheap to set up and dramatically improves both average and tail latency for the work users actually notice.

Worker Lifecycle: Memory, Supervision, and Graceful Shutdown

Workers are long-running PHP processes, and PHP's memory behavior over long lifetimes is not great. A worker should restart periodically to reclaim memory—the --limit and --memory-limit flags on messenger:consume handle this by exiting the process cleanly after N messages or M megabytes. A process supervisor—systemd, supervisord, or Kubernetes—then restarts the exited worker within seconds.

Graceful shutdown matters just as much as restart. When a deployment replaces a worker, the running process should finish its current message and acknowledge it to the broker before exiting, not drop it mid-flight. Messenger handles SIGTERM correctly out of the box, but your orchestrator needs a shutdown grace period long enough for the longest-running handler to complete—typically 30 to 60 seconds is safe.

Monitoring the worker pool itself is the other half of the equation. Queue depth, consumer count, and consumer lag are the three metrics that tell you whether the system is healthy. A consumer count that drops to zero during business hours is a pager. A queue depth that grows unbounded is a pager. Consumer lag over ten minutes on a critical queue is a pager. Everything else is a dashboard. Teams that have solid code quality consulting reviews typically have these alerts wired into their incident response before the job system goes into production, not after the first silent-failure postmortem.

Putting It All Together

A production-ready background job system has five load-bearing pieces. Transports are chosen deliberately per durability profile, with critical work on RabbitMQ or equivalent rather than Redis alone. Handlers are idempotent by design and throw unrecoverable exceptions for messages that cannot succeed. Retries use exponential backoff with a finite ceiling, not infinite retry loops. Every transport has a monitored dead-letter queue with a human review process. And priority queues with separate worker pools prevent bulk work from starving user-facing jobs.

None of this requires exotic infrastructure. Symfony Messenger on a properly configured RabbitMQ cluster, operated with supervisord and monitored with Prometheus and Grafana, runs comfortably at tens of millions of messages per day on modest hardware. The investment is in the design, not the tooling.

If your team is seeing silent failures, growing dead-letter backlogs, or worker restarts that never quite keep up with the queue depth, Wolf-Tech offers architecture reviews focused specifically on async processing reliability. We have debugged the full range of failure modes described above on real production systems, often as part of broader legacy code optimization engagements. Reach us at hello@wolf-tech.io or visit wolf-tech.io for a free initial consultation.