Agent-Native Durability: Why I Built AXME Code

A few months ago, I was caught in a cycle of zero-knowledge cold starts. Every time I ran Claude, it was Day Zero. I was spending my own tokens re-explaining my stack, my architectural preferences, and my safety requirements on every session. I felt like I was working with a genius architect who had severe short-term memory loss. I was tired of being the "Context Engineer" for my own tools. So I built AXME Code.

The Infrastructure: The Project Oracle

AXME doesn't just store strings. It manages a persistent, structured knowledge base in your project root. This is the Project Oracle I've been working on for a while now. It extracts decisions and patterns from your code, configs, and history so your agent starts every session at "senior dev" level context. Not a database. Not one long markdown file. A typed store with explicit decisions, memories, safety rules, and a per-session handoff — all readable as plain markdown and YAML, all versionable, all greppable.

Why Not Existing Market Solutions?

I looked at the market for a solution, but everything I found — from Mastra to Zep — treated agents like a search-engine problem. They weren't building for the non-deterministic, long-running reality of software engineering. They lacked the hard guardrails and the structured persistence needed for production work.

<div class="not-prose my-10" style="width: min(95vw, 1024px); margin-left: 50%; transform: translateX(-50%);"> <div class="overflow-x-auto"> <table class="w-full text-sm border-collapse"> <thead> <tr class="border-b border-gray-800"> <th class="text-left py-3 px-3 text-gray-400 font-normal"></th> <th class="py-3 px-3 text-emerald-400 font-semibold">AXME Code</th> <th class="py-3 px-3 text-gray-400 font-normal">MemPalace</th> <th class="py-3 px-3 text-gray-400 font-normal">Mastra</th> <th class="py-3 px-3 text-gray-400 font-normal">Zep</th> <th class="py-3 px-3 text-gray-400 font-normal">Mem0</th> <th class="py-3 px-3 text-gray-400 font-normal">Supermemory</th> </tr> </thead> <tbody class="text-gray-300"> <tr><td colspan="7" class="pt-4 pb-2 text-xs uppercase tracking-wider text-gray-500 font-semibold">Capabilities</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">Structured decisions with enforcement levels</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">Pre-execution safety hooks</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-yellow-500">partial</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">Structured session handoff</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-yellow-500">partial</td><td class="py-2 px-3 text-center text-gray-600">—</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">Automatic knowledge extraction</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">Project map and codebase context</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">Multi-repo workspace</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">Local-only storage</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">Semantic memory search</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">Multi-client support</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td><td class="py-2 px-3 text-center text-emerald-400">✓</td></tr> <tr class="border-b border-gray-800"><td class="py-2 px-3 font-semibold text-white">Capabilities total</td><td class="py-2 px-3 text-center text-emerald-400 font-bold">9/9</td><td class="py-2 px-3 text-center">3/9</td><td class="py-2 px-3 text-center">4/9</td><td class="py-2 px-3 text-center">3/9</td><td class="py-2 px-3 text-center">3/9</td><td class="py-2 px-3 text-center">3/9</td></tr> <tr><td colspan="7" class="pt-5 pb-2 text-xs uppercase tracking-wider text-gray-500 font-semibold">Benchmarks</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">ToolEmu safety (accuracy)</td><td class="py-2 px-3 text-center text-emerald-400 font-semibold">100.00%</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">ToolEmu safety (FPR)</td><td class="py-2 px-3 text-center text-emerald-400 font-semibold">0.00%</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">LongMemEval E2E</td><td class="py-2 px-3 text-center text-emerald-400 font-semibold">89.20%</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center">84.23% / 94.87%</td><td class="py-2 px-3 text-center">71.20%</td><td class="py-2 px-3 text-center">49.00%</td><td class="py-2 px-3 text-center">85.40%</td></tr> <tr class="border-b border-gray-900"><td class="py-2 px-3">LongMemEval R@5</td><td class="py-2 px-3 text-center text-emerald-400 font-semibold">97.80%</td><td class="py-2 px-3 text-center">96.60%</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center text-gray-600">—</td></tr> <tr><td class="py-2 px-3">Tokens per correct</td><td class="py-2 px-3 text-center text-emerald-400 font-semibold">~10K</td><td class="py-2 px-3 text-center text-gray-600">—</td><td class="py-2 px-3 text-center">~105–119K</td><td class="py-2 px-3 text-center">~70K</td><td class="py-2 px-3 text-center">~31K</td><td class="py-2 px-3 text-center">~29K</td></tr> </tbody> </table> </div> <p class="text-xs text-gray-500 mt-4 italic">AXME values and MemPalace R@5 are measured. Competitor LongMemEval scores are from published results; token counts are estimates from each system's methodology. Full breakdown, footnotes, and reproduction instructions in the <a href="https://github.com/AxmeAI/axme-code/blob/main/benchmarks/README.md" class="text-emerald-400 hover:underline">benchmarks README</a>.</p> </div>

While other tools provide basic semantic search, they miss the critical primitives I needed: structured decisions with enforcement, pre-execution safety hooks, and multi-repo workspace awareness. I didn't want a "memory palace"; I wanted a durable harness.

Deterministic Guardrails: Safety Hooks

Prompts are a weak defense. If an agent has terminal access, I need a deterministic blocking layer. Relying on an agent's "best effort" to remember a safety rule is how your Friday night gets ruined.

AXME Code implements hook-based safety guardrails — pre-tool-use intercepts at the Claude Code harness level, independent of the agent's prompt. In our ToolEmu benchmark, AXME achieved 100% safety accuracy with 0% false positives on the tested cases.

Agent: "I'll force push to update the branch..."
$ git push --force origin main

❌ BLOCKED by pre-tool-use hook
   Rule:    force push denied on protected branch 'main'
   Result:  the command was NOT executed

The hook fires before the tool call reaches the shell. There is no path that lets the agent talk its way around it — even in bypassPermissions mode.

Performance at Scale: ~10x Fewer Tokens per Correct Answer

One thing that always bothered me about existing memory systems was the background LLM noise. Most tools use an Observer/Reflector pattern — the agent is constantly narrating its own thoughts to maintain state. This leads to 100K+ tokens per simple answer.

I engineered AXME Code on a semantic Reader + Judge architecture: 2 LLM calls per question at evaluation time, no continuous narration loop.

<div class="not-prose my-10" style="width: min(95vw, 1024px); margin-left: 50%; transform: translateX(-50%);"> <div class="bg-gray-900 border border-gray-800 rounded-xl p-6"> <h3 class="text-white font-semibold mb-3 text-center">Token efficiency on LongMemEval</h3> <img src="/benchmarks-token-efficiency.svg" alt="LongMemEval token efficiency chart — AXME Code sits in the top-right corner (high accuracy, fewer tokens per correct answer) vs competitors." class="w-full rounded-lg" /> <p class="text-xs text-gray-500 mt-4 text-center italic">Top-right = best. In our LongMemEval-based benchmark, AXME used ~10× fewer tokens per correct answer than Mastra while delivering competitive accuracy. Model-agnostic — pricing changes, token counts don't.</p> </div> </div>

AXME's tokens are measured from a 500-question run; competitor tokens are estimated from each system's published methodology (Observer per turn, Reflector aggregation, fact extraction, graph construction). Full methodology and reproduction instructions in the benchmarks README.

By moving the "thinking" into the pre-execution phase, we maximize the agent's focus on actual code generation.

The Anxiety of the Pause: Block, Handoff, Resume

In a professional repo, work is rarely synchronous. But in a stateless shell, "waiting" is where context goes to die. I used to feel a constant pressure to answer the agent immediately. I knew that if I closed the terminal while I went to find an API key, I'd return to a "cold" session and pay the re-onboarding tax all over again.

AXME's approach is more boring than "asynchronous HITL", and that's the point:

A safety hook blocks the dangerous command and returns the rule that triggered it.
When the session closes, the background auditor writes a structured handoff: what was being attempted, what blocked it, what the in-progress state is.
When you start the next session, the agent loads that handoff before doing anything else. It picks up at the exact point of the block, not at the top of the conversation.

The result is the same as the result you actually want: you can walk away mid-task without losing the context, and when you come back the agent already knows what's unresolved. The work doesn't have to be synchronous to be durable.

Conclusion: A Durable Harness, Not a Memory Palace

AXME Code is my response to the exhaustion of zero-knowledge development. We've moved past treating agent memory as just "store the chat log" — toward a durable harness where the Project Oracle provides the ground truth, structured decisions carry enforcement levels, and hooks turn safety from a polite request into a hard intercept.

It's time to move from prompting agents to running them on infrastructure that survives the gaps between sessions — the way our codebases already do.

code.axme.ai · github.com/AxmeAI/axme-code