Mar 29, 2026 · cloud
Your Agent Has Been Stuck for 3 Hours and Nobody Knows
API call hangs. Agent blocks. No timeout fires. No one gets paged. The default behavior in most agent frameworks is silence. Here's how to fix that.
Your agent calls an external API. The API hangs. Not a timeout - it just stops responding. TCP connection is open, but nothing comes back.
In most frameworks, your agent is now stuck. No error. No timeout. No alert. Just... waiting.
How Agents Handle Stuck Tasks Today
LangGraph: Agent blocks on tool call. No built-in timeout.
CrewAI: max_execution_time exists. But no notification. No escalation.
Raw Python: requests.get(url, timeout=30)? Great. Now what?
You caught the timeout. The agent retried. Still failing.
Who gets notified? Nobody. What happens next? Nothing.
The problem isn't detecting the timeout. timeout=30 is one line. The problem is everything after:
- Retry - try again, maybe with different parameters
- Timeout on retries - don't retry forever
- Notify - tell someone the agent is stuck
- Remind - if that person doesn't respond in 10 minutes, remind them
- Escalate - if they still don't respond, notify someone else
- Audit - record what happened, when, who responded
That's not a timeout. That's a workflow.
What You Actually Build
# timeout_watcher.py - runs as a separate process
def check_stuck_tasks():
stuck = db.query(
"SELECT * FROM agent_tasks WHERE status = 'running' "
"AND started_at < NOW() - INTERVAL '5 minutes'"
)
for task in stuck:
if task.retry_count < 3:
retry_task(task)
else:
send_slack_alert(task.owner, f"Agent stuck: {task.name}")
db.update(task.id, status="needs_human")
# escalation_cron.py - runs every 10 minutes
def check_unacknowledged():
unacked = db.query(
"SELECT * FROM agent_tasks WHERE status = 'needs_human' "
"AND notified_at < NOW() - INTERVAL '30 minutes'"
)
for task in unacked:
if task.escalation_level == 0:
send_slack_alert(task.backup_owner, f"Escalation: {task.name}")
db.update(task.id, escalation_level=1)
else:
page_oncall(f"Agent stuck, nobody responding: {task.name}")
Two cron jobs. Two database queries. Slack integration. PagerDuty integration. Schema for escalation levels. And this is the simplified version.
What This Should Look Like
from axme import AxmeClient, AxmeClientConfig
client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))
intent_id = client.send_intent({
"intent_type": "intent.ops.stuck_task.v1",
"to_agent": "agent://myorg/production/task-runner",
"payload": {
"task": "sync_inventory",
"api_endpoint": "https://api.warehouse.internal/sync",
"error": "Connection timeout after 30s",
},
})
result = client.wait_for(intent_id)
The scenario definition handles the rest:
- Agent gets 60 seconds to complete the task
- If it fails, platform retries (up to 3 attempts)
- After all retries fail, escalation fires: human approval gate opens
- Human gets notified immediately
- Reminder at 3 minutes if no response
- Repeat reminder every 30 minutes
- 30-minute total timeout on human response
- Full audit trail of every state transition
No cron jobs. No escalation table. No Slack webhook code.
Timeline: Without vs With
Without AXME:
00:00 Agent calls API
00:30 Timeout. Agent retries.
01:00 Timeout again. Agent retries.
01:30 Timeout again. Agent gives up. Logs error.
...
03:00 Someone notices the dashboard is red.
03:15 Engineer checks logs, finds the stuck task.
03:30 Engineer manually restarts.
With AXME:
00:00 Agent calls API via intent
00:30 Timeout. Platform retries.
01:00 Timeout again. Platform retries.
01:30 Third failure. Platform escalates to human.
01:30 On-call engineer gets notification.
01:33 Reminder (no response yet).
01:35 Engineer approves manual override from phone.
01:35 Agent resumes with override instruction.
The difference: 3 hours of silence vs 1.5 minutes to human response.
Try It
Working example - agent detects stuck task, automatic retry, escalation to human with reminders:
github.com/AxmeAI/agent-timeout-and-escalation
Built with AXME - timeout, escalation, and human approval built into the lifecycle. Alpha - feedback welcome.
Related posts
Your AI Agent Crashed at Step 47. Why Isn't Crash Recovery the Default?
Your agent ran 47 steps of a 50-step pipeline. Then it crashed. The state is gone. Every framework says you should have configured checkpointing. Why isn't durability built in?
How to Add Human Approval to AI Agent Workflows Without Building It Yourself
Adding a human approval step to an AI agent workflow means building a notification service, reminder scheduler, escalation chain, and webhook handler. Or you can use 4 lines of code.
A2A Tells Agents How to Talk. It Doesn't Tell Them What Happens When Things Break.
Google's A2A protocol handles agent communication. But crash recovery, retries, timeouts, and human approval gates? That's still on you. Unless you add a lifecycle layer.
Your AutoGen Agents Can't Talk Across Machines. Here's the Missing Piece.
AutoGen handles multi-agent conversations beautifully - inside one process. Put agents on different machines and you're back to building message brokers from scratch.