Mar 29, 2026 · cloud
A2A Tells Agents How to Talk. It Doesn't Tell Them What Happens When Things Break.
Google's A2A protocol handles agent communication. But crash recovery, retries, timeouts, and human approval gates? That's still on you. Unless you add a lifecycle layer.
Google's Agent-to-Agent protocol is a clean spec. Agent A sends a task to Agent B over HTTP. Agent B responds. Done.
Until Agent B crashes mid-task. Or goes down for 20 minutes during a deploy. Or needs a human to approve something before it can continue.
What A2A Gives You
A2A defines how agents find each other (Agent Cards), how they exchange messages (Tasks), and how they stream responses (SSE). It's a communication protocol. A good one.
What it doesn't define:
- What happens when the receiving agent crashes mid-task
- How to retry delivery if the first attempt fails
- How to enforce a timeout if the agent takes too long
- How to add a human approval step between agents
- How to track whether a task actually completed
These aren't edge cases. They're the entire second half of building a production multi-agent system.
The Code You End Up Writing
# Sender: deliver task to Agent B with retry
async def send_task_with_retry(agent_url, task, max_retries=3):
for attempt in range(max_retries):
try:
resp = requests.post(f"{agent_url}/tasks", json=task)
if resp.status_code == 200:
return resp.json()
if resp.status_code >= 500:
raise RetryableError()
except (ConnectionError, Timeout, RetryableError):
delay = min(2 ** attempt + random.uniform(0, 1), 60)
await asyncio.sleep(delay)
# All retries failed - now what?
db.insert("failed_tasks", task=task, attempts=max_retries)
alert(f"Task delivery failed after {max_retries} retries")
return None
# Plus: timeout watcher, health check loop, DLQ consumer,
# delivery confirmation handler, state reconciliation cron...
You're not writing agent logic anymore. You're writing infrastructure.
What A2A + Lifecycle Looks Like
from axme import AxmeClient, AxmeClientConfig
client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))
intent_id = client.send_intent({
"intent_type": "intent.a2a.task_execution.v1",
"to_agent": "agent://myorg/production/data-enrichment",
"payload": {
"task": "enrich_customer_data",
"records": 1500,
"source": "crm_export_q1",
},
})
result = client.wait_for(intent_id)
Agent B crashes at record 800? Platform retries delivery (up to 3 attempts). Agent B takes too long? 120-second deadline enforced. Need a human to approve the enrichment first? Add an approval gate to the scenario. The A2A communication still works - it just has a lifecycle around it now.
What Changes, What Stays
| Raw A2A | A2A + AXME lifecycle | |
|---|---|---|
| Agent discovery | Agent Cards | Agent Cards |
| Message format | A2A Tasks | A2A Tasks (wrapped in intent) |
| Delivery | HTTP POST (you retry) | At-least-once (platform retries) |
| Crash recovery | Task lost | Automatic redelivery |
| Timeout | You build it | Configurable deadline |
| Human approval | Not in spec | Built-in gate |
| Observability | You instrument | Lifecycle events (SSE) |
| Audit trail | You build it | Every state transition logged |
A2A stays as the communication layer. AXME adds the operational layer on top.
When You Need This
A2A alone works fine for:
- Request/response between two always-on agents
- Demo scenarios where crashes don't happen
- Internal tooling where "restart and try again" is acceptable
You need a lifecycle layer when:
- Agent B might be down during deploys
- Tasks take minutes or hours, not milliseconds
- A human needs to approve before Agent B acts
- You need to prove a task completed (compliance, audit)
- "It probably worked" isn't good enough
Try It
Working example - Agent A sends enrichment task to Agent B with crash recovery, timeout, and lifecycle tracking:
github.com/AxmeAI/a2a-with-durable-lifecycle
Built with AXME - durable lifecycle for agent operations. Alpha - feedback welcome.
Related posts
Your AI Agent Crashed at Step 47. Why Isn't Crash Recovery the Default?
Your agent ran 47 steps of a 50-step pipeline. Then it crashed. The state is gone. Every framework says you should have configured checkpointing. Why isn't durability built in?
How to Add Human Approval to AI Agent Workflows Without Building It Yourself
Adding a human approval step to an AI agent workflow means building a notification service, reminder scheduler, escalation chain, and webhook handler. Or you can use 4 lines of code.
Your Agent Has Been Stuck for 3 Hours and Nobody Knows
API call hangs. Agent blocks. No timeout fires. No one gets paged. The default behavior in most agent frameworks is silence. Here's how to fix that.
Your AutoGen Agents Can't Talk Across Machines. Here's the Missing Piece.
AutoGen handles multi-agent conversations beautifully - inside one process. Put agents on different machines and you're back to building message brokers from scratch.