Mar 29, 2026 · cloud
I Stopped Building Webhook Retry Logic. Here's What I Use Instead.
Exponential backoff, jitter, dead letter queues, idempotency keys, HMAC verification - all to deliver one message reliably. There are better options now.
Every backend team eventually builds the same thing: reliable message delivery between services. And every team builds it wrong at least once.
The Webhook Retry Stack
Here's what "just use webhooks" actually means in production:
# Receiver: build an HTTP endpoint
@app.post("/webhooks/orders")
async def receive_order(req):
# Verify HMAC signature (or get spoofed)
signature = req.headers.get("x-webhook-signature")
if not verify_hmac(signature, req.body, WEBHOOK_SECRET):
return {"error": "invalid signature"}, 401
# Idempotency check (webhooks arrive twice, sometimes three times)
idempotency_key = req.headers.get("x-idempotency-key")
if db.exists("processed_webhooks", idempotency_key):
return {"status": "already processed"}, 200
process_order(req.json())
db.insert("processed_webhooks", idempotency_key)
return {"status": "ok"}, 200
# Sender: retry with backoff
async def send_with_retry(url, payload, max_retries=5):
for attempt in range(max_retries):
try:
resp = requests.post(url, json=payload, headers=sign(payload))
if resp.status_code == 200:
return resp
if resp.status_code >= 500:
raise RetryableError()
except (ConnectionError, Timeout, RetryableError):
delay = min(2 ** attempt + random.uniform(0, 1), 300)
await asyncio.sleep(delay)
dlq.send(payload) # dead letter queue
alert("Webhook delivery failed after 5 retries")
And this is the simplified version. Production adds:
- DLQ consumer that retries or alerts
- Monitoring for delivery success rate
- Alerting on DLQ depth
- Cleanup cron for the idempotency table
- Secret rotation for HMAC keys
- Circuit breaker when receiver is down
- Thundering herd protection when receiver comes back up
That's 200+ lines of infrastructure code. For every pair of services that need to talk.
The Alternative: Let the Platform Deliver
from axme import AxmeClient, AxmeClientConfig
client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))
intent_id = client.send_intent({
"intent_type": "intent.order.process.v1",
"to_agent": "agent://myorg/production/order-processor",
"payload": {
"order_id": "ORD-2026-00142",
"customer": "acme-corp",
"total": 4999.50,
},
})
result = client.wait_for(intent_id)
No webhook endpoint on the receiver. No HMAC. No idempotency table. No retry logic. No DLQ. No monitoring for delivery failures.
The platform handles at-least-once delivery on all channels.
Five Ways to Receive (Not Just Webhooks)
The receiver picks the delivery mode that fits their architecture:
| Mode | Transport | Best For |
|---|---|---|
stream | SSE (server-sent events) | Real-time agents, always-on services |
poll | GET request | Serverless functions, cron jobs |
http | Webhook POST | Traditional services (but platform handles retry) |
inbox | Human queue | Approvals, reviews, manual tasks |
internal | Platform-handled | Reminders, escalations, notifications |
The sender doesn't care which mode the receiver uses. send_intent() is the same regardless.
This is the key difference from webhooks: the receiver chooses how to get messages, not the sender. The sender doesn't need to know if the receiver is a Lambda function, a Kubernetes pod, or a human with an email inbox.
What You Stop Building
| Component | With Webhooks | With AXME |
|---|---|---|
| HTTP endpoint on receiver | You build it | Not needed (for stream/poll/inbox) |
| HMAC verification | You build it | Platform handles |
| Idempotency table | You build it | Built into intent lifecycle |
| Retry with backoff | You build it | Platform handles (configurable) |
| Dead letter queue | You build it | Platform handles |
| Delivery monitoring | You build it | Built-in lifecycle events |
| Secret rotation | You manage | Platform manages |
| Thundering herd protection | You build it | Platform handles |
When Webhooks Are Still Fine
Webhooks work well when:
- The receiver is always up (99.9%+ uptime)
- Occasional message loss is acceptable
- You only have 2-3 service pairs communicating
- You already have the retry infrastructure built
Webhooks break down when:
- You have 10+ services that need reliable delivery
- Receivers go down for minutes/hours (deploys, incidents)
- You need delivery guarantees (financial transactions, compliance)
- You need human approval gates in the delivery chain
- You're tired of debugging "why didn't the webhook arrive"
Try It
Working example - sender submits an order, receiver processes it via SSE stream, no webhook endpoint needed:
github.com/AxmeAI/reliable-delivery-without-webhooks
Python, TypeScript, and Go implementations included.
Built with AXME - 5 delivery bindings with at-least-once guarantees. Alpha - feedback welcome.
Related posts
A Two-Step Approval Chain Shouldn't Need a Workflow Engine
Manager approves, then finance approves. Simple enough to describe. 300 lines of code to build. Unless your platform handles approval chains natively.
Your AI Agent Crashed at Step 47. Why Isn't Crash Recovery the Default?
Your agent ran 47 steps of a 50-step pipeline. Then it crashed. The state is gone. Every framework says you should have configured checkpointing. Why isn't durability built in?
How to Add Human Approval to AI Agent Workflows Without Building It Yourself
Adding a human approval step to an AI agent workflow means building a notification service, reminder scheduler, escalation chain, and webhook handler. Or you can use 4 lines of code.
Temporal Alternative Without the Cluster and Determinism Constraints
Temporal is the go-to for durable execution. But it requires a cluster and forces determinism constraints on your workflow code. Here's an alternative that gives you durability through a managed API with none of those trade-offs.