Building Reliable Webhooks: Lessons from Production Systems
You add a webhook endpoint. It works in development. You ship it. Three weeks later, a customer reports that their integration missed 40 events over the weekend, and you have no idea which ones or why. Welcome to the reality of webhook reliability in production.
Webhooks are deceptively simple: an HTTP POST from one system to another when something happens. But the simplicity is a trap. Unlike message queues or event streams, webhooks push data over the open internet, where network failures, timeouts, server restarts, and misconfigured firewalls are the norm rather than the exception. The sender does not know if the receiver processed the event, crashed mid-request, or silently dropped the payload. The receiver does not know if a missing event means nothing happened or the delivery failed.
Most teams discover these problems only after they have caused real business impact—missed orders, duplicate charges, broken sync workflows. The patterns for building webhooks that actually work in production are well-established, but they require deliberate design decisions that go far beyond "send an HTTP POST."
Why Webhooks Fail Silently (and Why You Don't Notice)
The core problem with webhooks is that failure is the default state of the internet. HTTP requests fail for dozens of reasons: DNS resolution timeouts, TLS handshake errors, connection resets, 502 responses from overloaded load balancers, request timeouts, and receiving endpoints that return 200 but silently discard the payload because of an internal error.
Most webhook implementations handle none of these cases. They fire an HTTP request and move on. If it fails, the event is gone. If you are lucky, there is a log entry somewhere. If you are not, the failure is invisible until a customer complains.
The second failure mode is more subtle: duplicate delivery. A network timeout does not mean the request was not received. The receiver may have processed the event, started sending its 200 response, and the connection dropped before the sender received the acknowledgment. The sender retries, and now the receiver processes the same event twice. In a payment system, that means a double charge. In an inventory system, that means phantom stock adjustments.
The third failure mode is ordering violations. Webhook events arrive in whatever order the network and the sender's retry logic produce. An order.updated event can arrive before order.created if the creation delivery was delayed by a retry. Your receiver needs to handle this, and most do not.
Idempotency: The Non-Negotiable Foundation
Every reliable webhook system starts with idempotency—the guarantee that processing the same event multiple times produces the same result as processing it once. This is not optional. Network conditions guarantee that duplicates will arrive eventually.
Implementing Idempotency on the Receiver Side
The pattern is straightforward: every webhook event carries a unique identifier, and the receiver tracks which identifiers it has already processed.
In a PHP/Symfony application, this looks like a middleware that checks an idempotency store before handing off to the event handler:
class WebhookIdempotencyMiddleware
{
public function __construct(
private readonly Connection $db,
private readonly LoggerInterface $logger,
) {}
public function process(WebhookEvent $event): bool
{
$eventId = $event->getId();
// Atomic check-and-insert using database constraint
try {
$this->db->insert('webhook_processed_events', [
'event_id' => $eventId,
'source' => $event->getSource(),
'processed_at' => new \DateTimeImmutable(),
]);
} catch (UniqueConstraintViolationException) {
$this->logger->info('Duplicate webhook skipped', [
'event_id' => $eventId,
]);
return true; // Already processed — return success
}
return false; // Not a duplicate — proceed with processing
}
}
The critical detail is the atomic check-and-insert. Do not query first, then insert. Under concurrent delivery (which happens during retry storms), a SELECT followed by an INSERT creates a race condition where two threads both see "not processed" and both proceed. A unique constraint on the event_id column eliminates this race entirely.
Idempotency on the Sender Side
If you are building the sending side of a webhook system, include a unique event_id (UUID v4 or v7) in every payload. This gives receivers a stable key for deduplication. Also include an event_type and a monotonically increasing sequence_number or a created_at timestamp so receivers can detect ordering issues.
{
"event_id": "evt_01HZ3K5M7N8P9Q0R1S2T3U4V5W",
"event_type": "order.completed",
"created_at": "2026-04-13T12:34:56Z",
"data": {
"order_id": "ord_98765",
"total": 149.99,
"currency": "EUR"
}
}
Retry Logic That Does Not Cause More Problems Than It Solves
Retries are essential—without them, any transient network failure permanently loses the event. But naive retry logic (retry immediately, retry forever, retry at fixed intervals) creates its own category of production incidents.
Exponential Backoff with Jitter
The standard pattern is exponential backoff with randomized jitter. The backoff gives the receiver time to recover from whatever caused the failure. The jitter prevents the thundering herd problem, where hundreds of failed webhooks all retry at exactly the same second and overwhelm the receiver again.
A practical retry schedule looks like this: 30 seconds, 2 minutes, 10 minutes, 1 hour, 4 hours, 12 hours, 24 hours. That gives you seven attempts over roughly 41 hours. After the final attempt, the event moves to a dead-letter queue (more on this below).
function calculateRetryDelay(attemptNumber: number): number {
const baseDelayMs = 30_000; // 30 seconds
const maxDelayMs = 86_400_000; // 24 hours
// Exponential backoff: 30s, 60s, 120s, 240s...
const exponentialDelay = baseDelayMs * Math.pow(2, attemptNumber - 1);
const cappedDelay = Math.min(exponentialDelay, maxDelayMs);
// Add up to 20% jitter to prevent thundering herd
const jitter = cappedDelay * 0.2 * Math.random();
return cappedDelay + jitter;
}
Circuit Breaker for Persistently Failing Endpoints
If a receiver endpoint has been returning errors for hours, continuing to retry every webhook to that endpoint wastes resources and can mask the real problem. Implement a circuit breaker: after N consecutive failures to the same endpoint URL, pause all deliveries to that endpoint and alert the integration owner.
This is particularly important for multi-tenant SaaS platforms where one customer's broken endpoint should not consume your entire webhook delivery capacity.
Signature Verification: Proving the Payload Is Authentic
Without payload verification, a webhook endpoint is an unauthenticated HTTP endpoint that accepts arbitrary JSON from anyone on the internet. This is a security vulnerability, not a feature.
The standard approach is HMAC-SHA256 signing. The sender computes a hash of the raw request body using a shared secret, includes the hash in a header, and the receiver recomputes the hash and compares.
class WebhookSignatureVerifier
{
public function verify(
string $payload,
string $signatureHeader,
string $secret,
): bool {
$expectedSignature = hash_hmac('sha256', $payload, $secret);
// Constant-time comparison prevents timing attacks
return hash_equals($expectedSignature, $signatureHeader);
}
}
Three details matter here. First, always use hash_equals() (or its equivalent in your language) for the comparison. A regular string comparison leaks timing information that can be used to forge signatures byte by byte. Second, compute the HMAC on the raw request body, not on a parsed-and-reserialized version. JSON serialization is not deterministic—key ordering, whitespace, and Unicode escaping can differ between serializer implementations, and any difference breaks the signature. Third, include a timestamp in the signed payload and reject events older than a few minutes. This prevents replay attacks where a captured valid webhook is resent later.
Rotating Secrets Without Downtime
At some point, you need to rotate the signing secret—after a team member leaves, after a suspected compromise, or simply as routine hygiene. The pattern is to support two active secrets simultaneously during the rotation window: sign outgoing webhooks with the new secret, but accept signatures computed with either the old or new secret on the receiving side. After a grace period (24–48 hours), deactivate the old secret.
Dead-Letter Queues: Where Failed Events Go to Be Investigated
After exhausting all retry attempts, the event needs to go somewhere visible and recoverable. A dead-letter queue (DLQ) stores failed webhook deliveries with full context: the original payload, the target URL, every attempt timestamp, and every error response.
This serves two purposes. First, it prevents permanent data loss. An operations team or automated process can inspect the DLQ and replay events after the underlying issue is fixed. Second, it provides diagnostic information. A DLQ full of 403 responses tells you the receiver rotated their credentials. A DLQ full of timeouts tells you the receiver's infrastructure is undersized.
In a Symfony application using Messenger, you can configure a dead-letter transport directly:
# config/packages/messenger.yaml
framework:
messenger:
transports:
webhook_delivery:
dsn: '%env(RABBITMQ_DSN)%'
retry_strategy:
max_retries: 7
delay: 30000
multiplier: 3
max_delay: 86400000
webhook_dead_letter:
dsn: '%env(RABBITMQ_DSN)%'
options:
queues:
webhook_dlq:
binding_keys: ['webhook.failed']
failure_transport: webhook_dead_letter
Build a simple admin interface or CLI command that lists DLQ entries, lets operators inspect the payload and error history, and provides a one-click replay. This turns a production incident from "we lost 200 events and need to reconstruct them manually" into "we replayed 200 events after fixing the endpoint."
Payload Validation: Trust Nothing from the Network
Even with signature verification, validate the payload structure before processing. Webhook payloads evolve over time—senders add fields, change types, deprecate properties. A receiver that blindly trusts the payload shape will break when the sender ships their next API version.
Validate aggressively at the boundary:
import { z } from 'zod';
const OrderCompletedPayload = z.object({
event_id: z.string().min(1),
event_type: z.literal('order.completed'),
created_at: z.string().datetime(),
data: z.object({
order_id: z.string().min(1),
total: z.number().positive(),
currency: z.string().length(3),
}),
});
export function handleWebhook(rawBody: string): void {
const parsed = JSON.parse(rawBody);
const result = OrderCompletedPayload.safeParse(parsed);
if (!result.success) {
// Log the validation error with full context, return 400
logger.warn('Webhook payload validation failed', {
errors: result.error.issues,
event_type: parsed?.event_type,
});
throw new WebhookValidationError(result.error);
}
// Process the validated, type-safe payload
processOrderCompleted(result.data);
}
Return a 400 status code for validation failures, not a 200. A 200 tells the sender "I processed this successfully," which suppresses retries. A 400 tells the sender "your payload is malformed," which is useful diagnostic information. Reserve 500 for genuine server errors that warrant a retry.
Observability: Knowing What Your Webhooks Are Doing
Webhook systems need dedicated monitoring because they operate across network boundaries where standard application monitoring has blind spots. At minimum, track these metrics:
Delivery success rate — the percentage of webhook deliveries that succeed on the first attempt, broken down by endpoint and event type. A healthy system runs above 95% first-attempt success. Below 90% signals infrastructure or integration problems.
Delivery latency (p50, p95, p99) — how long deliveries take from event creation to successful acknowledgment. Latency spikes often precede delivery failures as endpoints become overloaded.
Retry rate — how many events require at least one retry. A rising retry rate is an early warning signal, even if eventual delivery success remains high.
DLQ depth — how many events are sitting in the dead-letter queue. This should be zero or near-zero during normal operations. Any non-trivial depth warrants investigation.
Endpoint health — per-endpoint success rates and response times. This lets you proactively notify integration partners when their endpoint is degrading, before events start piling up in the DLQ.
Structured logging is the minimum viable approach. Every delivery attempt should produce a log entry with the event ID, endpoint URL, attempt number, HTTP status code, and response time. This makes debugging specific delivery failures a matter of filtering logs rather than guessing.
Designing for the Receiver: Accept Fast, Process Later
A common mistake on the receiving side is doing all the work inside the HTTP request handler. If processing takes more than a few seconds, the sender's HTTP client times out and retries, creating duplicate processing and unnecessary load.
The correct pattern is accept fast, process asynchronously. The webhook endpoint validates the signature, checks idempotency, persists the raw event to a durable store, and returns 200 immediately. A background worker picks up the persisted event and handles the actual business logic at its own pace.
This decouples your processing speed from the sender's timeout expectations and gives you a natural replay mechanism—if processing fails, the raw event is already stored and can be retried locally without depending on the sender to redeliver.
Putting It All Together
Reliable webhook systems are not complicated, but they are deliberate. The patterns—idempotency, exponential backoff with jitter, HMAC signature verification, dead-letter queues, payload validation, and async processing—are well-established and individually straightforward. The challenge is implementing all of them together, because skipping any one creates a failure mode that will eventually surface in production.
If your current webhook implementation is a simple HTTP POST with no retries, no idempotency, and no monitoring, you are not running a webhook system—you are running a best-effort notification that silently drops events under real-world conditions. The gap between "works in development" and "reliable in production" is exactly these patterns.
At Wolf-Tech, we have built and audited webhook systems handling millions of daily events across payment platforms, SaaS integrations, and real-time data pipelines. If your integration layer is unreliable or you are building a new webhook system and want to get the architecture right from the start, we can help. Reach out at hello@wolf-tech.io or visit wolf-tech.io for a free consultation.

