Webhooks You Send: Designing a Reliable Outbound Webhook System for Your API
Most webhook tutorials focus on receiving: how to verify a signature, how to return 200 quickly, how to process asynchronously. That is all useful. But if your SaaS product fires events at customer endpoints, you are now on the other side of that contract. You are the one writing the delivery code. And building a reliable outbound webhook system is a genuinely different problem.
Customers will build automation on top of your events. They will write code that assumes your webhooks arrive in order, that retries eventually stop, and that a 200 response means you will not send the same event again. When your system breaks those assumptions, you break their systems. This post is about designing an outbound webhook pipeline with delivery guarantees that customers can actually rely on.
Why Outbound Webhook Delivery Is a Distributed Systems Problem
Sending an HTTP request to a customer URL sounds simple. It is not, because you do not control the other end.
The receiving server might be slow, returning 500s, behind a firewall that just changed, or simply down. Your delivery attempt might time out after 5 seconds while the customer's handler actually ran and succeeded. You might receive a 200 that was sent before the handler finished processing. Any of these scenarios can leave your system in an ambiguous state: did the customer receive the event or not?
This ambiguity is what makes outbound webhook delivery a distributed systems problem. You need to design for it explicitly rather than hoping HTTP will take care of it.
The two properties customers actually need from your system are: at-least-once delivery (every event eventually reaches them, even if it takes multiple attempts) and idempotent payloads (sending the same event twice does not corrupt their data). Getting both right requires deliberate design on your end.
The Core Architecture: Decouple Event Creation from Delivery
The first mistake most teams make is trying to deliver webhook events inline - calling the customer URL synchronously during the request that triggered the event. This creates a hard dependency between your API's response time and the latency of an external HTTP request you do not control. It also means that if delivery fails, you have no clean way to retry.
The correct pattern separates event creation from delivery:
- When something happens (a payment succeeds, a user is created, a subscription changes), write an event record to your database in the same transaction. Do not attempt delivery yet.
- A background worker picks up pending events and attempts delivery.
- The worker updates the event record based on the outcome and schedules retries if needed.
This architecture means your API responses are never blocked by customer endpoint latency. It also gives you a durable record of every event and every delivery attempt - which you will need for debugging and for customer-facing delivery logs.
The event record should include at minimum: a stable event ID, the event type, the full payload as JSON, the target URL at the time of creation, the current delivery status, the number of attempts made, and the timestamp of the next scheduled attempt.
Retry Logic: Exponential Backoff with a Ceiling
When a delivery attempt fails, you need to retry. The question is when, and for how long.
Exponential backoff is the right starting point. The interval between attempts grows with each failure: 30 seconds, 5 minutes, 30 minutes, 2 hours, 12 hours, 24 hours. This gives transient failures - a brief outage, a deployment restart, a rate limit hit - time to resolve without hammering an already-struggling endpoint. The exact schedule matters less than having a schedule at all.
Set a ceiling on the number of attempts. Somewhere between 10 and 20 is typical. After the final attempt fails, move the event to a dead-letter state and stop retrying. Continuing indefinitely is not a feature - it creates unbounded system load and confuses customers who have already investigated and resolved the issue.
Also set a ceiling on the delivery window. Many teams limit retries to 72 hours from the original event. If something has not delivered after three days, it is usually no longer actionable - the customer's system has moved on, and delivering a stale event might cause more problems than skipping it.
Keep your HTTP timeout tight. 10 seconds is usually enough. A customer endpoint that takes longer than that is either overloaded or broken, and waiting longer does not help anyone. If the connection times out, treat it as a failure and schedule a retry.
Payload Signing: Let Customers Verify You Are Actually You
When your webhook arrives at a customer endpoint, they have no way to know it came from you unless you sign it. Anyone who knows a customer's webhook URL could post a fake payload.
The standard approach is HMAC-SHA256. For each delivery attempt, compute a signature over the raw request body using a secret key that you and the customer share. Include the signature in a request header (X-Webhook-Signature or similar). Also include a timestamp (X-Webhook-Timestamp) so customers can reject replays of old requests.
The signing key should be per-webhook-endpoint, not per-customer. This lets customers rotate keys without affecting all their endpoints, and lets you scope what a compromised key can do. Give customers a self-service key rotation flow in your dashboard.
Document the exact signing algorithm with working code examples in multiple languages. Customers will copy and paste this code directly. If your documentation is wrong or ambiguous, you will get support tickets from customers with broken integrations that are actually your documentation's fault.
Event Ordering: Be Honest About Your Guarantees
Customers will assume your events arrive in order unless you tell them otherwise. They are usually wrong to assume this, and you need to decide what to actually guarantee.
True ordering guarantees are expensive. They require per-customer delivery queues, serial processing within each queue, and careful handling of failures that does not let a stuck event block everything behind it. This is achievable, but it is a real system with real operational overhead.
A simpler and often better approach: deliver events as quickly as possible (which means mostly in order), include a sequence number or created_at timestamp in every payload, and document that customers should use this field to handle out-of-order delivery. Tell them explicitly that retries may cause earlier events to arrive after later ones.
If you do offer ordering guarantees, scope them carefully. Ordering within a specific resource (all events for order ID 12345 arrive in order) is more achievable than global ordering across all events. Be specific in your documentation about what your guarantee covers.
Idempotency: Make the Event ID Do Real Work
Every event you send should have a stable, unique ID that you include in the payload and in a dedicated request header (X-Event-ID or Webhook-Event-ID).
Customers should use this ID to deduplicate events on their end. They will receive the same event more than once - your retry logic guarantees it. Without a stable ID they can use as an idempotency key, they cannot safely process retries.
Do not generate a new ID for each delivery attempt. The ID identifies the event, not the attempt. The attempt number is separate metadata. If you send event evt_abc123 three times before the customer's endpoint comes back online, all three requests should carry evt_abc123, and the customer should process the first successful one and ignore the rest.
Make the event IDs UUIDs or similarly collision-resistant identifiers. Sequential integers leak information about your event volume and create contention in high-throughput systems.
Delivery Logs: Give Customers Visibility
When a customer's integration breaks, the first thing they will do is log into your dashboard and look for information about what happened. If you do not have delivery logs, they will open a support ticket. If you do have delivery logs, they might be able to self-serve.
Keep logs of every delivery attempt: the event ID, the attempt number, the timestamp, the response status code, the response body (truncated if needed), and the error reason if the request failed before getting a response. Expose these in your dashboard and, ideally, via your API.
This data is also useful to your own team. When customers report missing events, delivery logs are the first place to look. Without them, debugging webhook delivery issues is mostly guesswork.
Retention policy matters. Keeping delivery logs for 30 days is usually sufficient. Longer than that and the storage costs start to matter; shorter and you cannot investigate older reports.
Testing Your Own Webhook System
The hardest part of testing outbound webhook delivery is simulating a bad receiver. You need to be able to test slow responses, timeouts, 500 errors, and the case where the receiver returns 200 but crashes before processing.
Services like webhook.site and Svix Playground are useful for manual inspection, but you want automated tests that can simulate failure modes. A simple test receiver that you can configure to return arbitrary status codes and delays will cover most scenarios.
Test your retry backoff logic explicitly. Verify that after a configurable number of failures, events reach dead-letter state rather than retrying indefinitely. Test that signing is applied correctly by verifying the signature with the same algorithm you document for customers. Test that duplicate delivery is safe - that sending the same event twice does not create inconsistent state in a test consumer.
Operational Concerns: Monitoring and Customer Communication
Once your webhook system is in production, you need to know when it is struggling. Alert on dead-letter queue depth growing faster than usual. Alert when average delivery latency for successful events starts creeping up. Alert if a specific customer endpoint has been failing continuously for more than an hour - they may not know their integration is broken.
Decide in advance how you will handle customers with consistently failing endpoints. Many teams auto-disable webhook delivery after a threshold of consecutive failures (say, 5 days with no successful delivery) and send the customer an email. This prevents your retry workers from spending resources on endpoints that are clearly broken, and it prompts customers to investigate rather than silently losing events.
Communicate major incidents proactively. If your webhook delivery system had an outage and events are delayed or lost, tell customers before they tell you. A brief status update - what happened, which events were affected, what you are doing about it - is significantly better for trust than customers discovering the problem themselves.
When to Use Svix, Hookdeck, or a Managed Service
Building this infrastructure from scratch is not always the right call. Services like Svix and Hookdeck provide outbound webhook infrastructure as a service: managed delivery, retry logic, signing, delivery logs, customer-facing dashboards, and SDKs. If webhooks are not your core product and you do not have strong reasons to build the plumbing yourself, using a managed service is often the faster and more reliable path.
The tradeoff is cost, vendor dependency, and the limits of what the service supports. For most early-stage SaaS products, the managed service wins. For a product where webhook delivery is itself a differentiator, or where you have unusual requirements around ordering or payload size, building your own gives you more control.
Getting the Fundamentals Right
A reliable outbound webhook system comes down to a handful of decisions made deliberately early:
Decouple event creation from delivery using a persistent queue. Use exponential backoff with a ceiling on both attempts and time. Sign every payload and document the signing algorithm with working examples. Be explicit about your ordering guarantees. Give every event a stable ID that customers can use for deduplication. Keep delivery logs and expose them to customers.
None of these are complex in isolation. The challenge is doing all of them consistently, and keeping them working as your event volume scales and your customer base grows.
If you are building an outbound webhook system and want a technical review of the design before you commit to an implementation, reach out at hello@wolf-tech.io. We work with SaaS teams on API architecture and custom software development where the details of distributed systems - delivery guarantees, retry semantics, failure modes - are the product. You can find more about how we approach this kind of work at wolf-tech.io.

