Debug Webhook Timeouts and 500 Errors: A Systematic Guide

Published June 10, 202612 min read

Your provider's dashboard says “webhook delivery failed.” Maybe it's a timeout, maybe a 500, maybe it's intermittent and only happens at 2 AM. Meanwhile Stripe is quietly counting failures against your endpoint, and if it keeps failing long enough, deliveries stop entirely.

This guide is a systematic walkthrough: what providers actually do when your endpoint misbehaves, a triage tree to find the real cause, the one architectural change that fixes most timeout problems, and how to use a webhook debugger to see both the request and the response side of every delivery.

What Providers Do When Your Endpoint Fails

Before debugging anything, it helps to know what game you're playing. Every webhook provider applies three policies to your endpoint: a timeout budget, a retry schedule, and — the one that hurts — an auto-disable rule for endpoints that keep failing.

Timeout budgets

Most providers give you somewhere between 5 and 30 seconds to return a response, and some are much stricter. Slack's Events API expects an acknowledgment within about 3 seconds. GitHub documents a 10-second limit for webhook deliveries. Stripe doesn't prominently publish a single number, but timeouts in the ballpark of 10–20 seconds are commonly observed. Twilio defaults to around 15 seconds. The exact numbers shift over time, so treat your provider's current docs as the source of truth — but the design rule is universal: if your handler can't respond in a couple of seconds, you have a problem waiting to happen.

Retries with backoff

A timeout or a 5xx response usually triggers a retry schedule with increasing delays. Stripe retries failed live deliveries with exponential backoff for up to about three days. Shopify retries over a multi-hour window. GitHub, on the other hand, does not automatically retry failed deliveries — you have to redeliver manually or via the API. Retries are a safety net, but they also mean failures arrive in waves: one broken deploy can produce a burst of duplicate-looking traffic hours later. Our webhook retries guide covers how to design for this.

Auto-disabling endpoints

This is the failure mode that turns a bug into an incident. Stripe will notify you by email when an endpoint keeps failing and can disable it after multiple days of sustained failures. GitHub deactivates hooks that fail repeatedly over an extended period. Shopify removes webhook subscriptions that keep failing. Once that happens, events stop being delivered at all — no retries, nothing to replay from your logs, just a gap in your data that you discover later. If your webhooks have been failing for days, fixing the handler is only half the job; check whether the endpoint is still enabled on the provider side.

Here's a rough reference for four common providers. These policies change, so verify against current docs before relying on a specific number — the point is the shape of the behavior, not the exact values.

ProviderTimeout (approx.)RetriesSustained failures
Stripe~10–20s observedExponential backoff, up to ~3 days (live mode)Email warnings, then endpoint can be disabled
GitHub10s (documented)No automatic retries; manual/API redeliveryHooks failing repeatedly can be deactivated
ShopifySingle-digit secondsMultiple retries over a multi-hour windowSubscription removed after continued failures
Slack (Events API)~3sA few retries in quick successionEvent deliveries can be turned off for the app

The takeaway: a webhook endpoint that is “mostly fine” is on a clock. Intermittent 500s and slow responses accumulate into disabled endpoints. The sooner you can answer “why exactly is this failing?”, the better.

The Triage Tree: Three Questions, In Order

Almost every “my webhooks are failing” situation resolves into one of three categories. Work through them in order — each one is cheaper to check than the next.

Question 1: Is the request even arriving?

Don't assume. DNS changes, expired TLS certificates, load balancer misconfigurations, firewall rules, and WAF/bot-protection products all silently eat webhook traffic before it reaches your application. If your handler logs show nothing while the provider reports failures, the request is dying in transit.

Quick checks: does the provider's delivery log show a connection error, a TLS error, or an HTTP status? A connection or TLS error means infrastructure. An HTTP status means something answered — even if it wasn't your code (a 502/503 from a load balancer or reverse proxy counts).

The fastest way to isolate this layer is to take your application out of the equation entirely. Create a free webhook tester endpoint, point the provider at it, and trigger a test event. If the request shows up there with a 200, the provider side is healthy and the problem is in your infrastructure or your handler.

Question 2: Is your handler responding too slowly?

Timeouts are sneakier than 500s because they pass in testing and fail in production. Your handler takes 800ms with one test event; under real load, with a cold database connection pool, a slow third-party API call, and a burst of 50 events at once, it takes 12 seconds — and the provider hung up at 10.

The signature of a timeout problem: failures correlate with traffic spikes or specific times of day, the provider log says “timed out” rather than showing a status code, and your own logs show the handler completing — eventually. Your code “works”; it just doesn't work fast enough.

Measure your endpoint's response time from outside your own network:

curl -o /dev/null -s -X POST https://api.your-app.com/webhooks/stripe \
  -H "Content-Type: application/json" \
  -d '{"type": "ping"}' \
  -w "status: %{http_code}\ntotal:  %{time_total}s\nconnect: %{time_connect}s\nTTFB:   %{time_starttransfer}s\n"

If total time is regularly above 2–3 seconds, skip ahead to the acknowledge-fast section below. That's your fix.

Question 3: Is it 500ing on specific payloads?

If some deliveries succeed and others fail with a 500, you have a payload-dependent bug. The tell: the same event type fails consistently on retry (the provider resends the same payload, your handler throws the same exception), while other events sail through. This is the “webhook failing intermittently” case that isn't actually intermittent — it's deterministic on payload shape, and the payloads vary.

To debug this you need the exact failing payload — not a reconstruction from your logs, which probably truncated or re-serialized it. Capture it (more on that below), find the field your code chokes on, fix the handler, and replay the same payload to verify.

The #1 Fix: Acknowledge Fast, Process Async

Most webhook timeout problems — and a surprising share of 500s — come from one architectural mistake: doing the actual work inside the HTTP request handler. The provider doesn't need you to finish processing the event. It needs you to confirm receipt. Those are different jobs, and coupling them means every slow database write and flaky downstream API call becomes a delivery failure.

// Before: everything inline. Works in dev, times out in production.
app.post("/webhooks/stripe", async (req, res) => {
  const event = verifySignature(req);     // fast
  await syncInvoiceToDatabase(event);     // slow under load
  await sendReceiptEmail(event);          // depends on a third-party API
  await updateAnalytics(event);           // why is this even here
  res.sendStatus(200);                    // Stripe gave up seconds ago
});

// After: acknowledge fast, process async.
app.post("/webhooks/stripe", async (req, res) => {
  const event = verifySignature(req);     // still synchronous -- always
                                          // verify before accepting
  await queue.add("stripe-event", event, {
    jobId: event.id,                      // dedupe: retries reuse the
  });                                     // same event ID
  res.sendStatus(200);                    // returns in milliseconds
});

The queue can be anything durable: BullMQ on Redis, SQS, a Postgres jobs table, Cloud Tasks. What matters is that the HTTP handler does exactly three things — verify the signature, persist the event, return 200 — and nothing else. Three consequences fall out of this:

Timeouts disappear

Enqueueing takes milliseconds regardless of load. The provider's timeout budget stops being your problem.

500s stop reaching the provider

A bug in your processing logic fails a background job you can retry on your own terms — not a delivery the provider counts against your endpoint.

Retries become harmless

Keying jobs on the provider's event ID makes duplicate deliveries idempotent by construction.

One honest caveat: some integrations genuinely need a computed response body (Twilio's TwiML replies, some Slack interactivity payloads). Those can't be fully deferred. For everything that just needs an acknowledgment — Stripe, GitHub, Shopify, and most SaaS webhooks — acknowledge fast and process async. For a deeper look at delivery guarantees, see real-time webhooks reliability.

The Payloads That Break Handlers

When a handler 500s on some payloads but not others, the same few culprits show up again and again. If you're staring at a stack trace, check these first:

Unexpected event types

You subscribed to invoice.* and wrote handlers for the three event types you saw in testing. Then the provider sends invoice.marked_uncollectible and your switch statement falls into a code path that assumes fields that aren't there. Unknown event types should be logged and acknowledged with a 200, never treated as errors.

Nulls where you assumed values

Optional fields are null until the one customer with no shipping address, no tax ID, and a deleted payment method triggers an event. payload.customer.email.toLowerCase() works for months, then throws on the first null email. Validate the payload against a schema at the edge and treat every nested field as potentially absent.

Unicode and encoding surprises

Emoji in a product name, right-to-left text in a customer field, lone surrogates from a truncated string upstream. These break naive byte-length checks, databases with the wrong column encoding (MySQL's utf8 vs utf8mb4is a classic), and any code that slices strings by byte offset. They also break signature verification if a middleware re-encodes the body before you compute the HMAC — always verify against the raw bytes.

Large arrays and oversized bodies

A bulk operation upstream produces a webhook with a 5,000-item line array. Your JSON body-parser limit (often a default of 100 KB or 1 MB) rejects it with a 413 or 500 before your code runs, or the handler processes the array inline and blows the timeout budget. Check your framework's body-size limits and treat payload size as unbounded.

The common thread: your handler was tested against the payloads you imagined, and production sends the payloads that exist. The fix is rarely clever — it's seeing the exact failing payload and adding the missing guard.

Seeing Both Sides of the Exchange with Hooklistener

The hard part of webhook debugging is that the evidence is split across two systems. The provider's dashboard shows you what it sent and a status code; your logs show whatever you remembered to log. Neither shows the full HTTP exchange. Hooklistener sits in the middle and records both directions: point the failing provider at a Hooklistener endpoint directly, or put one in front of your real handler via forwarding (or the CLI tunnel if you're reproducing locally).

Response capture

For every captured webhook, Hooklistener records the HTTP response it sent back — status code, headers, body (up to 10 KB), and where that response came from (the endpoint's default response or a matched mock response rule). Each request has a Response tab showing exactly what the provider saw, and the request list has a color-coded status column: green for 2xx, blue for 3xx, amber for 4xx, red for 5xx. Scanning for the red rows in a list of deliveries beats grepping two sets of logs.

Replay the exact failing payload

Once you've captured the payload that 500s, you can replay it against your handler as many times as you need while you fix the bug — no waiting for the provider to retry, no triggering fake events from a dashboard. Replays can also be sent with a modified body or headers, which is how you answer questions like “does it still crash if that field is null but the array is small?” in seconds instead of deploys.

Simulate your own failures

Endpoints can be configured to return custom status codes and bodies. Want to observe a provider's real retry schedule instead of trusting the docs? Set the endpoint to return a 500, trigger an event, and watch the retries arrive with their actual timing and backoff. The same trick works for testing how your own systems handle a slow or failing downstream — see mock webhook server in 2 minutes.

AI comparison for the “what's different?” question

The hardest intermittent failures come down to one question: what is different between a delivery that worked and one that didn't? On paid plans you can select two to five captured requests and get a structured AI analysis: shared patterns, differences, anomalies, timing analysis, payload evolution, header issues, and suggested debugging actions. Diffing a working request against a failing one this way routinely surfaces the null field or encoding quirk that eyeballing two JSON blobs misses. See pricing for plan details — and if you work in an AI coding assistant, the MCP server lets it inspect captured requests directly.

A concrete workflow for the payload-dependent 500 from the triage tree: point the provider at a Hooklistener endpoint, wait for (or trigger) the failing event, open the capture to see the exact request and the response status side by side, fix your handler, then replay the captured payload against the fixed handler until it returns 200. Total feedback loop: seconds.

Prevention: Know Before the Provider Does

The worst version of this problem is the one you find out about from a customer — or from Stripe's “we've disabled your webhook endpoint” email. By then you've been failing for days. Two habits prevent that:

Monitor the endpoint itself

Your webhook URL is production infrastructure; monitor it like one. Hooklistener's uptime monitors run configurable checks against any URL and alert you via Slack, Discord, or email when it stops responding or starts returning errors — plus a shareable public status page if other teams depend on your integration. A monitor that pings your webhook route catches the expired certificate or the broken deploy hours before the provider's failure counter does.

Alert on your own error rate

Inside your application, track two numbers per webhook route: response time and non-2xx rate. Alert when p95 latency creeps toward half your provider's timeout budget, and on any sustained non-2xx rate above zero. With the acknowledge-fast pattern in place, both numbers should be boring — which is exactly why a change in them is signal.

And periodically check your provider dashboards for endpoints in a warning or disabled state. It takes a minute and it's the only place some failure modes are visible at all.

Debug Your Failing Webhooks Now

Point your failing provider at a Hooklistener endpoint and see the exact request, the exact response, and the color-coded status of every delivery — then replay the failing payload until your fix holds. Free tier, no credit card.

Start Debugging Free →

Related Resources