Compiling Intelligence: How LLM Self-Generated Memory Elevates Vertical Agent Applications

📅 April 2026 ✍️ Wenxue Cao 🏷️ AI · LLM Agents · Memory Systems

LLM Agents Memory Architecture Context Engineering Agent Harness RAG MemGPT / Letta

The Amnesia Problem

Every time you start a new conversation with an AI assistant, it forgets everything. Not just the last conversation — every conversation, every lesson learned, every mistake corrected. Andrej Karpathy gave this problem its sharpest framing in a widely-shared 2025 talk at YC AI Startup School:

Andrej Karpathy · YC AI Startup School, 2025

"LLMs are a bit like a coworker with anterograde amnesia — they don't consolidate or build long-running knowledge or expertise once training is over, and all they have is short-term memory (context window)."

The diagnosis is precise. LLM weights are frozen at training time, and the context window — the model's working memory — resets entirely at the start of every session. Whatever reasoning, corrections, or domain knowledge accumulated during a conversation is discarded. For years, the standard response to this structural limitation was Retrieval Augmented Generation (RAG): index a human-built knowledge base and retrieve relevant chunks on demand. It addressed the symptom without touching the underlying architecture — and as agents take on longer, more complex tasks, its limitations have become harder to ignore.

Why RAG Is Not Enough

RAG has a fundamental architectural flaw: it has no write path. Every query begins from the same fixed document index. The system cannot learn that two documents are connected, cannot flag contradictions discovered during a session, and cannot accumulate the kind of tacit understanding that distinguishes an expert from a novice. The Letta team (creators of MemGPT) put it plainly: RAG is "a stateless, read-only retrieval pipeline where every query begins fresh."

The numbers now back this up. Mem0's State of AI Agent Memory 2026 report benchmarked memory approaches on the LOCOMO multi-session conversation dataset:

Approach	Accuracy	P95 Latency	Tokens/Query
Full-context (all history)	72.9%	17.12s	~26,000
Mem0 graph memory	68.4%	2.59s	~1,800
Mem0 vector memory	66.9%	1.44s	~1,800
RAG	61.0%	0.70s	—
OpenAI built-in memory	52.9%	—	—

RAG scores 61% accuracy — well below graph-based agent memory at 68.4%, and nearly 12 points below full context. But full context at 26,000 tokens per query is not production-viable. The report's conclusion: "A system that scores well on accuracy but requires 26,000 tokens per query is not production-viable." The implication is that structured, LLM-generated memory — accurate like full context, efficient like RAG — is the missing middle ground.

There is also a subtler problem that benchmarks rarely capture: human-written knowledge bases are optimised for human readers, not for LLMs. Humans write in narrative prose with cultural context, implicit assumptions, and rhetorical structure. An LLM reading such text must spend representational capacity on decoding all of that before extracting the underlying information. This is inefficient — and it introduces noise.

The Core Insight: Let the LLM Write for Itself

The key insight that motivates a new generation of memory architectures is deceptively simple: memory generated by an LLM is more efficiently read by an LLM than memory written by a human.

When an LLM writes a summary, a structured note, or an entry in its own knowledge base, it naturally organises information in the format it finds most useful for retrieval — using the vocabulary, abstraction level, and logical structure that map most directly onto its own internal representations. Reading this back is fundamentally cheaper than parsing human prose.

Karpathy formalised this idea in his April 2026 GitHub Gist llm-wiki. He proposes replacing stateless RAG with an LLM-maintained wiki — a structured collection of Markdown files that an agent incrementally builds, cross-references, and keeps current:

Andrej Karpathy · GitHub Gist "llm-wiki", April 2026

"The wiki is a persistent, compounding artifact — contradictions already flagged, cross-references already there. A single source might touch 10–15 wiki pages. Good answers can be filed back into the wiki as new pages."

↗ Read the Gist

The proposed architecture is three-layer: raw/ (immutable source documents) → wiki/ (LLM-generated Markdown, continuously updated) → CLAUDE.md (schema and instructions for the agent). The agent reads from the wiki, not the raw sources — synthesis and cross-referencing are already done and stored, not repeated on every query.

The community response was immediate. Researcher Yuchen Jin wrote:

Yuchen Jin · @Yuchenj_UW · April 2026

"Karpathy's 'LLM Wiki' pattern: stop using LLMs as search engines over your docs. Use them as tireless knowledge engineers who compile, cross-reference, and maintain a living wiki. Humans curate and think."

↗ View on X

Karpathy's follow-up praised a user's personal implementation called "Farzapedia" — a personal Wikipedia built on the same pattern — noting his preference for this explicit, transparent approach over commercial AI systems that claim to "get better the more you use them" through opaque background learning.

Industry Convergence: Everyone Is Building Memory

The recognition that persistent, self-written memory is essential has driven rapid movement across the entire AI industry in 2025–2026.

Anthropic · September 2025

Claude Context Management: Memory Tool + Context Editing

Anthropic shipped two developer tools addressing the memory problem directly. The Memory Tool gives Claude agents a file-based system for storing and retrieving information outside the context window, persisting across conversations. Context Editing automatically removes stale tool calls near token limits. Combined, they deliver a 39% performance improvement and 84% reduction in token consumption in a 100-turn web search evaluation. Anthropic's framing: "Context windows have limits, but real work doesn't." Automatic memory — conversation summarisation — rolled out to Claude Pro users in October 2025 and to Enterprise in early 2026.

OpenAI · April–June 2025

ChatGPT Cross-Session Memory

OpenAI expanded ChatGPT's memory to reference all past conversations — not just within a session — for Plus and Pro users in April 2025, then free users in June 2025. The feature attracted both praise and notable criticism: Simon Willison wrote "I really don't like ChatGPT's new memory dossier," flagging opacity and privacy concerns about the implicit profile being built without explicit user review. This highlighted a key open problem in agent memory design: users need visibility and control over what the agent remembers.

Letta (formerly MemGPT) · December 2025 – April 2026

Letta Code: Memory-First Coding Agent

Letta launched Letta Code in December 2025 — a coding agent built explicitly around persistent memory — which ranked #1 model-agnostic open-source coding agent on TerminalBench. Their key insight: "The more you work with an agent, the more context and memory it accumulates, and the better it becomes." In March 2026 they announced a shift to git-backed filesystem memory ("MemFS"): "You own the memory. You choose the model. You can see how the system works, down to the exact input and output tokens of the LLM itself."

Google · 2025–2026

Gemini Code Assist: Dynamic Coding Standards Memory

Google built persistent memory into Gemini Code Assist for pull request reviews — described as "a dynamic, evolving memory of team coding standards, style, and best practices derived from interactions and feedback within PRs." This is a concrete production example of LLM-generated institutional memory: the system learns your team's conventions from usage, not from a human-written style guide it must re-read every time.

The Agent Harness: Memory as Infrastructure

The 2025–2026 period has seen the emergence of a new engineering discipline: context engineering, or more broadly, harness engineering. LangChain's influential July 2025 post defined context engineering as "the delicate art and science of filling the context window with just the right information for the next step," noting that at Cognition (makers of Devin), it is described as "effectively the #1 job of engineers building AI agents."

Raw Experience
/ New Information

→

LLM Writes
Structured Note

→

Persistent
Memory Store

→

LLM Retrieves
Own Context

Paul Iusztin's March 2026 piece "Agentic Harness Engineering: LLMs as the New OS" defines the harness as "every piece of code, configuration, and execution logic that is not the model itself," and offers a striking result: changing only the harness — not the model — elevated an agent from outside the top 30 to top 5 in TerminalBench 2.0. The model is necessary but not sufficient; the harness is where competitive advantage lives.

The OpenDev agent paper (arXiv:2603.05344, March 2026) formalises this with a dual-memory architecture: episodic memory (session history) paired with working memory (current iteration state), using "adaptive context compaction" to progressively reduce older observations as the context fills. The key principle: "Memory is always flushed to disk before being discarded from context." The filesystem is not a fallback — it is the primary memory substrate.

ICLR 2026 dedicated an entire workshop to this theme — MemAgents: Memory for LLM-Based Agentic Systems — a signal that the field has crossed from engineering practice into mainstream ML research.

Key Research Systems

Shinn et al. · NeurIPS 2023

Reflexion: Learning Through Verbal Self-Reflection

One of the earliest demonstrations that LLM-generated memory outperforms human-provided context. After each failed attempt, the agent writes a verbal self-critique in natural language, stored in an episodic memory buffer and re-injected in the next attempt. No gradient updates, no model retraining — purely self-written memory. Reflexion achieved 91% pass@1 on HumanEval, surpassing GPT-4's 80% at the time.

Xu et al. · NeurIPS 2025 (submitted Feb 2025)

A-MEM: Agentic Memory with Zettelkasten

Draws on the Zettelkasten note-taking philosophy to build a self-organising memory network. When a new memory is stored, the agent generates structured notes, keywords, and cross-links to existing memories — and can retroactively update older memories as new information arrives. The knowledge base is never static; it evolves through the agent's own operations. Outperforms all baselines across six foundation models.

Mem0 · May 2025

OpenMemory MCP: Local-First Private Memory

A local MCP memory server compatible with Claude, Cursor, and Windsurf. All memory stored on your machine — nothing goes to the cloud. Exposes standardised tools: add_memories, search_memory, list_memories, delete_all_memories. Addresses the privacy gap that critics like Simon Willison identified in ChatGPT's opaque memory system.

A Concrete Example: Claude Code's Own Memory

An instructive real-world example of LLM-generated memory is Claude Code's own memory system — which this very article was written using. Rather than asking the user to maintain a context document, Claude Code writes its own Markdown memory files as it works, capturing user preferences, project conventions, key decisions, and lessons learned. On subsequent sessions, it reads these self-generated files to restore context.

Karpathy's 2025 year-in-review specifically highlighted Claude Code as "the first convincing demonstration of what an LLM agent looks like — something that in a loopy way strings together tool use and reasoning for extended problem solving." The memory system is central to why it works: it is not re-reading a user manual on every session, but reading notes it wrote itself, in a format it chose, about the specific project it is working on.

Nous Research took note of the pattern immediately after Karpathy's gist went viral:

Nous Research · @NousResearch · April 2026

"The LLM Wiki by @karpathy is now a built-in skill in Hermes-Agent, give it a try"

↗ View on X

Limitations and Open Challenges

LLM-generated memory is not without problems. If the model writes incorrect information into its memory store, those errors persist and compound — the system can become confidently wrong over time. The Mem0 2026 report identifies three open challenges the field has not yet solved: cross-session identity resolution (recognising that the same concept appears across different sessions under different names), memory staleness detection at scale (knowing when a remembered fact is no longer true), and privacy and consent governance (giving users meaningful control over what is retained).

Rohit Ghumare's April 2026 extension of Karpathy's llm-wiki pattern addresses some of these directly, adding memory lifecycle management, confidence scoring, knowledge graphs, and forgetting curves — a direction that suggests the next generation of agent memory systems will be not just persistent, but actively maintained and curated by the agent itself.

There is also the question of what the agent does not know to write down. A human expert knows which facts are important enough to document; an LLM agent writing its own memory may systematically under-represent information that seems unimportant at write time but turns out to matter later. Solving this requires either proactive memory writing (anticipating future needs) or retroactive enrichment (updating memory when a past gap becomes apparent) — both of which A-MEM and similar systems have begun to address.

Conclusion

The shift from human-built knowledge bases to LLM-generated memory represents a structural change in how capable agents are built — one the entire AI industry has now converged on simultaneously. OpenAI, Anthropic, Google, Letta, LangChain, and the open-source community are all building toward the same core architecture: an agent that writes, organises, and retrieves its own memory, compounding knowledge across sessions rather than starting from zero each time.

The competitive implication is clear. Performance differences between agents in 2026 are increasingly driven not by model size or parameter count, but by memory architecture and harness design. The Mem0 benchmark data shows an 18-point accuracy gap between RAG and full-context approaches — a gap that structured, LLM-generated memory can close at a fraction of the token cost. The intelligence is not just in the model weights. It is in what surrounds them: the memory system, the self-written context, and the engineering decisions that determine what the model knows at each step.

References

X Karpathy, A. (April 2026). llm-wiki GitHub Gist. ↗ gist.github.com/karpathy/442a6bf555914893e9891c11519de94f

X Karpathy, A. (April 2026). Follow-up on Farzapedia implementation. ↗ x.com/karpathy/status/2040572272944324650

Blog Karpathy, A. (December 2025). 2025 LLM Year in Review. ↗ karpathy.bearblog.dev/year-in-review-2025/

X Jin, Y. (April 2026). Commentary on the LLM Wiki pattern. ↗ x.com/Yuchenj_UW/status/2040482771576197377

X Nous Research (April 2026). LLM Wiki integrated into Hermes-Agent. ↗ x.com/NousResearch/status/2041378745332961462

Blog Mem0 (April 2026). State of AI Agent Memory 2026. ↗ mem0.ai/blog/state-of-ai-agent-memory-2026

Blog Iusztin, P. (March 2026). Agentic Harness Engineering: LLMs as the New OS. DecodingAI. ↗ decodingai.com/p/agentic-harness-engineering

Paper Bui, N. D. Q. (March 2026). Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned. arXiv:2603.05344. ↗ arxiv.org/html/2603.05344v1

Blog Anthropic (September 2025). Context Management: Memory Tool and Context Editing. ↗ claude.com/blog/context-management

Blog LangChain (July 2025). Context Engineering for Agents. ↗ blog.langchain.com/context-engineering-for-agents/

Blog Letta (December 2025). Introducing Letta Code. ↗ letta.com/blog/letta-code

Blog Mem0 (May 2025). Introducing OpenMemory MCP. ↗ mem0.ai/blog/introducing-openmemory-mcp

Paper Xu, W. et al. (2025). A-MEM: Agentic Memory for LLM Agents. arXiv:2502.12110. NeurIPS 2025. ↗ arxiv.org/abs/2502.12110

Paper Shinn, N. et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366. ↗ arxiv.org/abs/2303.11366

Paper Packer, C. et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. ↗ arxiv.org/abs/2310.08560

Blog Willison, S. (May 2025). I really don't like ChatGPT's new memory dossier. ↗ simonwillison.net/2025/May/21/chatgpt-new-memory/

Paper ICLR 2026 Workshop. MemAgents: Memory for LLM-Based Agentic Systems. ↗ openreview.net/forum?id=U51WxL382H

X / Gist Ghumare, R. (April 2026). LLM Wiki v2 with confidence scoring and forgetting curves. ↗ gist.github.com/rohitg00/2067ab416f7bbe447c1977edaaa681e2