Recursive Language Models: Why Teaching AI to Take Notes Beats Giving It a Bigger Brain
Recursive Language Models outperform massive context windows by teaching AI to take structured notes. Learn why GPT-5-mini beats GPT-5 in long-context reasoning

I've been obsessing over a problem lately. Not in a healthy way. More in a "staying up too late reading research papers" kind of way.
Here's the problem: we keep making AI context windows bigger. 100k tokens. 1 million. 10 million. The marketing numbers sound impressive, but there's a dirty secret nobody talks about at conferences. The effective context length, what the model actually understands and reasons over, lags way behind those flashy numbers.
You can shove an entire library into the prompt. The model will still get "lost in the middle." It'll hallucinate. It'll confuse details from page one with facts from page ten thousand. We call this phenomenon context rot, and it's been the elephant in the room for anyone building serious AI applications.
Then a paper dropped from MIT CSAIL in late December 2025 that made me sit up straight: "Recursive Language Models" by Alex Zhang, Tim Kraska, and Omar Khattab. It proposes something elegantly counterintuitive. Instead of making the brain bigger, teach it how to take notes.
The Ten Million Page Book Problem
Let me paint you a picture. Imagine someone hands you a book. Not a novel. Not a pamphlet. A book that is ten million pages long. Now I ask you to find one specific sentence about a red bicycle on page four million, two hundred thousand.
How would you approach this?
If you're a current AI model, you try to read the whole book at once. You shove all ten million pages into your short-term memory and pray nothing falls out. Spoiler: something always falls out. The more you cram in, the more confused you get. Your "attention" becomes diffuse, like trying to listen to a thousand people whispering simultaneously.
But if you're a human? You don't memorize the Library of Congress. You learn how to use the card catalog.
This is the fundamental insight behind Recursive Language Models. What if the AI didn't have to read the whole book at once? What if it could write a computer program to open the book, navigate to the specific page, read just that paragraph, and then close it?
Treating Context Like a Variable, Not Like Memory
Here's where things get interesting. In the RLM paradigm, the massive text isn't fed directly into the neural network. Instead, it's stored as a variable in a Python REPL environment. The AI sees the variable name, not the content.
So if I have that ten-million-page book, the model knows: "I have a variable called context, and it contains a lot of text." The model never tries to hold all that text in its attention simultaneously. Instead, it writes code to interact with it.
# The AI might write something like this
chunk = context[4_000_000:4_200_000] # Grab a specific section
answer = search_for_keyword(chunk, "red bicycle")This is treating text processing like a data engineering task. And here's where the "Recursive" part comes in. The model can call itself, or a smaller, cheaper model, inside that loop.
Think of it like a CEO delegating tasks. The CEO doesn't read every email in the mailroom. The CEO hires a manager, who hires analysts to read the emails and report back just the important stuff. The main AI (let's call it the Root Agent) orchestrates the work. It writes a script that says: "Take chunk number one. Send this chunk to a sub-agent and ask: Does this text support the user's hypothesis? Return Yes or No."
The researchers tested this using GPT-5 as the Root LLM and GPT-5-mini as the sub-agent. You don't need the smartest, most expensive model to scan for keywords or do basic summarization. You use the "mini" model for the heavy lifting in the loop, and the big model just synthesizes the final answer.
The Numbers That Made Me Rethink Everything
Usually, when we introduce complexity like loops and code generation, things break. So does this actually work better than just shoving text into a massive context window?
Significantly better. And I don't mean marginally better. I mean the kind of better that makes you question your assumptions.
On the OOLONG benchmark (a brutal test of long-context reasoning), the RLM using GPT-5-mini actually outperformed standard GPT-5 by more than double in terms of correct answers. Let that sink in. The "dumber" mini model, when organized recursively, beat the "smarter" model that was reading linearly.
Structure beats raw size.
Perhaps the most shocking finding? Performance didn't degrade even as they scaled the input to over 10 million tokens. In a traditional model, the more text you add, the noisier the signal gets. The RLM didn't care. Because it wasn't reading 10 million tokens. It was running queries over them.
On BrowseComp-Plus, a benchmark requiring synthesis across multiple documents, RLM with GPT-5 achieved 91.33% accuracy compared to 70.47% for the summarization baseline and 51% for a CodeAct agent with retrieval. That's not an incremental improvement. That's a paradigm shift.
Why Complexity Matters More Than Length
Not all long-context tasks are created equal. The researchers categorized tasks by how their processing cost scales with input size, and this distinction turned out to be critical.
Finding a single needle in a haystack? That's constant complexity. Double the haystack, same needle. Models handle this reasonably well.
Aggregating information from every line? Linear complexity. Double the input, double the work. This is where traditional models start to sweat.
Analyzing relationships between pairs of entries? Quadratic complexity. Double the input, quadruple the work. This is where things get brutal.
Here's the kicker: on quadratic tasks like OOLONG-Pairs, base models scored less than 0.1% F1. Essentially random. Flipping a coin would be competitive. The RLM? It hit 58%.
The base model doesn't just struggle with quadratic tasks. It completely fails. The recursive architecture's ability to decompose problems and delegate work to sub-agents is the only thing that makes these problems tractable at all. When every entry needs to be compared against every other entry, you can't hold that in your head. You need a system.
Why RAG Falls Short (And When It Matters)

"Wait," you might be thinking. "Isn't this just fancy RAG?" Retrieval Augmented Generation is the current industry standard for handling large contexts. You convert text into vectors, search for semantically similar chunks, and feed those to the model.
RAG works great when the answer looks like the question. But what happens when the answer depends on aggregating widely scattered facts? Or when you need to perform some transformation across the entire structure of a document?
Imagine I asked: "How many times did the sentiment change in this novel?"
RAG can't find that. There isn't one sentence that contains the answer. You need to analyze the whole book systematically.
An RLM can simply write a script:
# Divide novel into chapters
chapters = split_by_chapter(context)
sentiments = []
for chapter in chapters:
sentiment = analyze_sentiment(chapter) # Sub-LLM call
sentiments.append(sentiment)
# Count sentiment changes
changes = count_transitions(sentiments)RAG retrieves snippets. RLMs execute logic over structure. That's a fundamental difference.
The Attention Mechanism's Dirty Secret
To understand why this matters, we need to talk about the technical heart of transformers: the attention mechanism.
When you have a million tokens, the model calculates relationships between every token and every other token. The "attention map" gets diffuse. It's trying to maintain relevance scores across an impossibly large space, and the specific signal gets lost in the noise.
The RLM solves this by ensuring the model never listens to more than a few "people" at a time. The sub-call only sees a small chunk. Its attention is sharp and focused. Then it passes a compressed summary back to the parent. So the parent isn't dealing with raw noise; it's dealing with refined information.
This is reinventing virtual memory management, but for semantic understanding. We're building an operating system for attention.
The Honest Trade-offs
I'm not going to pretend this is magic. There are real trade-offs.
Latency. If the model has to write a script, execute it, spin up ten sub-agents to read chunks, and aggregate the answers, you're not getting a response in two seconds. The authors acknowledge that their implementation uses blocking, sequential calls. For a 10-million-token task, you might wait three minutes.
But consider the alternative. A human would take months. A standard model would fail or hallucinate. If the RLM takes three minutes but gives you a verifiably correct answer, that's a massive win for heavy-duty research tasks.
This isn't for chatting about cookie recipes. This is for analyzing massive legal discovery dumps. For finding contradictions across twenty years of medical research papers. For deep due diligence on complex topics.
Cost Variance. RLM costs have high variance. Most runs are comparable or even cheaper than base model calls, but some trajectories get expensive. If the model decides to spawn thousands of sub-calls (which Qwen3-Coder was prone to doing), your API bill can spike. You need guardrails: maximum recursion depth, maximum budget per query.
Coding Capability is the Bottleneck. The whole system depends on the model's ability to write good Python to slice and manipulate data. If your "CEO" can't write a clear job description, the interns don't know what to do. The researchers found that models without sufficient coding abilities struggled as RLMs.
This implies something interesting about the future of reasoning models. They might need to be trained primarily as coding models. Code is the language of logic and structure. Natural language is messy. By forcing the model to interact with its memory via code, you're forcing it to be logical and structured.
Emergent Patterns: How RLMs Actually Solve Problems
Even without explicit training for recursive behavior, some fascinating patterns emerged in RLM trajectories.
Filtering with model priors. The RLM would use regex queries to search for keywords it knew to look for based on the question, combined with phrases it had priors about. It doesn't just blindly scan; it uses what it already knows to narrow the search space.
Chunking and recursive decomposition. For information-dense problems, models would chunk by newlines or headers, then spawn a sub-call per chunk. The choice of decomposition strategy matters enormously for task performance.
Answer verification through sub-calls. RLMs would double-check their answers by spawning additional sub-calls with small, focused contexts. This implicitly avoids context rot by using fresh, uncontaminated contexts for verification.
Building outputs through variables. For tasks requiring long outputs (beyond the base model's generation limit), RLMs would construct answers iteratively in REPL variables, stitching together sub-call outputs programmatically.
The Sorcerer's Apprentice Problem

There's a risk here that's worth naming explicitly. If we have models recursively calling themselves, we could accidentally create an infinite loop that burns through API credits in seconds. The while True loop that keeps spawning sub-agents is the AI equivalent of the Sorcerer's Apprentice enchanting brooms to carry water without knowing how to stop them.
The implementation needs strict guardrails. Maximum recursion depth. Budget limits per query. The authors are clear: this isn't magic. It requires careful engineering of the environment.
Where This Goes Next
This paper dropped just weeks ago. What do I expect to see in the next six months?
Native integration. Right now, RLM is a wrapper running on top of the API. But imagine if model providers built this into the inference engine itself. You send a 10-million-token prompt, and behind the scenes, the server automatically spins up an RLM process to handle it. The user doesn't even know it's happening. They just get a really smart answer to a really long question.
Training for recursion. GPT-5 wasn't explicitly trained to use recursive loops. It just happens to be smart enough to do it. But if we fine-tune models specifically to be good at decomposing tasks and writing memory-management scripts? The efficiency gains could be exponential. We'd be training models to be better managers of their own minds.
Self-reflective attention management. This is the next frontier. Models that don't just process information but actively decide what to load into their "cognitive core," optimizing their own intake in real-time.
The Takeaway
For anyone dealing with massive datasets, whether legal, medical, scientific, or just the endless archives of the internet, this paper represents a pivotal moment.
We stopped trying to make the brain bigger. We started teaching it how to take notes.
It's not about the size of the window. It's about what you do with the view.
Just remember to check your API budget before you run that loop.
The full paper "Recursive Language Models" by Alex L. Zhang, Tim Kraska, and Omar Khattab is available on arXiv. If you're building systems that need to process large contexts, it's worth the read.