Recursive Language Models, in production.
This is a field report. Not a paper review, not a theoretical argument, not a pitch deck. A specific job ran on a specific evening last week and produced specific numbers, and the architecture underneath those numbers is the same architecture the MIT CSAIL team described in arXiv:2512.24601 eight weeks earlier. The point of writing this is to put real production data against the academic claim and tell you what held, what didn’t, and what the implications are for any engineering team thinking about how AI fits into the actual day-to-day of shipping software.
The job: refactor a five-year-old authentication implementation in a mid-sized TypeScript/Node monorepo from a legacy OAuth2 implicit flow with session cookies to OAuth2 authorization code flow with PKCE, refresh token rotation, and an updated token storage model. The codebase is roughly 180k lines across services. Auth touchpoints are scattered across roughly 60 files — middleware, route handlers, client libraries, test suites, fixture builders, and the migration scripts that touch any of the above. A competent engineer who has done this kind of migration before would estimate three to five days of focused work, plus testing, plus the inevitable follow-up week of catching the edge cases that didn’t show up in dev.
We ran it on HotCopy. The job completed in 6 minutes and 14 seconds. Total inference cost: $1.34. The output was a working migration with passing tests on the first synthesis pass, and one recursion cycle to fix a single edge case the orchestrator flagged before declaring done.
The next 14 minutes of your reading are the technical explanation of why those numbers are achievable on a managed RLM stack running on Cloudflare Workers AI, what specifically the orchestrator-plus-scout pattern compresses out of the traditional engineering workflow, and where this approach falls apart — because it does, in specific cases worth naming.
What the paper actually proposed
The Zhang, Kraska, and Khattab paper from MIT CSAIL (arXiv:2512.24601, December 2025) does not propose a new model. It proposes an inference strategy. The key reframing: treat the prompt as an external programmable environment that the model can decompose, examine, and recursively call itself over snippets of, rather than treating the prompt as a static thing the model has to consume in one forward pass.
The paper’s headline empirical claim is that a Recursive Language Model implementation running on a smaller base model can process inputs two orders of magnitude beyond the model’s nominal context window, and can outperform vanilla frontier LLMs and common long-context scaffolds on diverse long-context tasks at comparable or lower per-query cost. The specific benchmark numbers in the paper are interesting but not the point I want to make. The point I want to make is structural: the paper validates an architectural pattern that has been kicking around the agent-frameworks community for two years, but treats it as a serious inference paradigm rather than a hack. That’s the unlock.
The pattern, in plain language: a more capable “orchestrator” model holds the high-level plan and decomposes the work. Less expensive “scout” models do the parallel sub-tasks — read this file, summarize that function, validate this constraint, draft this implementation. Results return to the orchestrator. The orchestrator synthesizes, validates, and either declares done or recurses on the failing parts. No single model ever has to hold the whole problem at once.
The reason this matters in production is that the engineering workflows we want to compress with AI are not bounded by model capability anymore. They are bounded by context — the amount of code, history, and surrounding system that has to be understood to make a correct change. The traditional approach of “stuff more into the context window” has been hitting diminishing returns since mid-2024. Context rot is real. Models get worse at recall as the window fills, regardless of what the marketing says about token counts. The RLM pattern routes around the problem by making sure no single model invocation has to look at more context than it can actually use.
The execution, step by step
Here is what actually happened during those 6 minutes and 14 seconds.
Seconds 0–8: indexing and dispatch. The CLI shipped the local codebase manifest to the orchestrator — file paths, dependency graph, test surface map. The orchestrator received the task description (the OAuth2 → PKCE migration prompt) and produced an initial dispatch plan. The plan identified 9 parallel scout missions: map all current auth touchpoints, audit token storage patterns, audit session cookie usage, identify all redirect URI configurations, build PKCE code-verifier/challenge generation strategy, identify breaking changes for downstream services, draft the new middleware implementation, identify test surface that needs updating, and identify migration script patterns.
Seconds 8–94: parallel scout execution, first wave. The 9 scouts ran simultaneously. Each scout received only the files relevant to its mission — between 4 and 18 files per scout, never more than 40k tokens of code. Each scout produced a structured output back to the orchestrator: a findings document with specific file paths, line ranges, current patterns, and proposed changes. Total wall-clock for this wave: 86 seconds. If you imagine those nine missions run sequentially by a single big model, with each one waiting on the previous, the same work takes between 12 and 18 minutes on a frontier model with a 200k-token context, and the cost is substantially higher because each call carries the full project context whether it needs it or not.
Seconds 94–106: synthesis turn one. The orchestrator received all nine scout reports. It produced a consolidated migration plan: the order of changes, the dependency relationships between them, the rollback strategy, and the test sequencing. This is the part of the work that benefits from a more capable model holding the whole picture — there are roughly nine documents to reconcile, with constraints flowing between them, and the orchestrator has to produce a coherent execution order that respects all of them. 12 seconds of compute on the orchestrator, holding maybe 60k tokens of synthesized scout output. No raw code in this context.
Seconds 106–298: parallel scout execution, second wave. Twelve scouts dispatched to actually write the changes. Each scout received the migration plan section relevant to it, the current contents of the files it would modify, and the relevant test fixtures. Each scout produced a unified diff for its files. Some scouts were trivially fast (single-file middleware update, 14 seconds). Some scouts were slower (the migration script generator took 92 seconds because it had to produce SQL that respected the existing schema and the new token rotation model). The wave completed when the last scout returned, at 298 seconds elapsed total.
Seconds 298–342: synthesis turn two and validation. The orchestrator received twelve unified diffs. It validated them against the migration plan: did every planned change get made, are there any conflicts in the diffs that would prevent clean application, are there any imports or references that one scout introduced that another scout removed. This is the part of the work that prevents the multi-agent system from producing the classic failure mode — six agents making locally correct changes that globally don’t compose. The orchestrator caught one such conflict: the new middleware implementation referenced a utility function that the token storage refactor had renamed. It dispatched a tiny corrective scout (the recursion the paper talks about) to fix the import, completed at 342 seconds.
Seconds 342–374: test run and final check. The orchestrator applied the synthesized changes to a sandbox copy of the codebase and ran the existing test suite plus the new tests that the test-surface scout had drafted in the first wave. All tests passed. The orchestrator declared the job complete and surfaced the unified diff for review. Total wall-clock: 6 minutes 14 seconds. Total scout invocations: 22 (9 + 12 + 1 corrective). Total orchestrator turns: 3 (initial dispatch, synthesis one, synthesis two with validation).
The numbers, broken out:
- Scout invocations, 22 total: $0.92
- Orchestrator turns, 3 total: $0.42
- Total inference: $1.34
For a refactor that would consume between three and five engineer-days at a fully-loaded cost of somewhere between $4,000 and $8,000.
Why Workers AI matters specifically
The architectural argument for running this pattern on Cloudflare Workers AI rather than on traditional cloud inference is more subtle than the GPU-utilization argument I made in last month’s post, and worth being specific about.
The orchestrator-plus-scout pattern is bursty. For 86 seconds at the start of the job, we dispatched nine parallel scouts. For most of the rest of the job, nothing was running. If you provision GPU capacity for the peak fan-out (and you have to, because if you don’t, your scouts queue and your wall-clock blows up), you provision for 12-scout parallel and you sit at 0% utilization for most of the job. Multiply this across thousands of jobs running asynchronously across many users, and the average utilization is precisely the 5% Cast AI documented as the production-cluster average.
Pay-per-inference inverts this. We pay for the 22 scout invocations and the 3 orchestrator turns that actually ran. We do not pay for the capacity those invocations could have used but didn’t. The fan-out is free as long as the work fits — twelve scouts in parallel costs twelve scouts in parallel, not “a reserved 12-GPU cluster sitting idle when the scouts complete.” For a workload shape this bursty, the pricing model is the difference between $1.34 in inference and $50 in amortized cluster cost for the same job.
The second reason Workers AI matters specifically is placement. The CLI is shipping diffs back to the engineer’s machine in real time. The orchestrator state is in a Durable Object close to wherever the developer is — Cloudflare runs in 330+ cities, the latency to the nearest one is typically single-digit milliseconds. The scouts run in regional inference centers near the orchestrator. The data path from “engineer presses enter” to “first scout returns” is short by construction, because the entire stack is co-located on the same edge network. On a traditional setup, the CLI calls an API gateway in us-east-1, which calls an orchestrator service, which calls an inference cluster, which routes to a GPU, which returns up the same chain. The round-trip latency overhead alone can double the wall-clock time of a job this short.
The third reason — and this is the one most engineers underestimate until they’ve run it — is that the orchestrator itself is a Durable Object. The full transcript of the job, every scout dispatch, every scout return, every synthesis turn, every recursion decision, is recorded in the DO’s embedded SQL database at the moment it happens. When the developer comes back to the CLI three hours later to inspect what got changed and why, the entire reasoning trace is still there. No separate logging service. No “where did the orchestrator state go” question. The same primitive that runs the orchestration also records the orchestration, by virtue of being the same Durable Object. I wrote about this primitive specifically last quarter; the application to RLM orchestration is one of the cleanest payoffs of the DO-based architecture I’ve found.
What the speed actually means
Six minutes for a three-day refactor is the headline number. It is also, on its own, the wrong way to think about what changed.
The interesting thing about compressing a multi-day workflow into a six-minute job is not that the work happens faster. It is that the cost of attempting the work drops to nearly zero. An engineer evaluating whether to do a three-day refactor goes through a cost-benefit calculation: is this worth my next three days, given everything else on my plate, given the political work of getting buy-in from stakeholders, given the risk that I’ll discover halfway through that the refactor is harder than I thought and I have to either ship something half-done or abandon a week of work. The cost-benefit math kills a lot of refactors that should happen. Technical debt accumulates not because engineers are lazy, but because the activation energy of paying it down is too high relative to the perceived benefit.
When the same refactor costs six minutes and $1.34 to attempt, the calculation changes entirely. The engineer dispatches the job, goes to get coffee, comes back to a unified diff that either solves the problem or surfaces exactly what’s hard about solving the problem. If it solves it, great, ship it. If it doesn’t, the engineer has spent six minutes and a dollar to develop a much more detailed understanding of why the refactor is hard, which informs the manual approach that follows. The exploratory cost has collapsed.
This is the part of the AI-for-engineering story that the productivity-multiplier framing misses. The metric that matters is not “how much faster does each individual task go.” The metric is “how many tasks that previously didn’t get done at all now get attempted, because the cost of attempting collapsed.” Refactors that engineers used to defer indefinitely. Audits that nobody had time for. Test coverage backfilling that always lost out to feature work. Documentation generation that was nominally a priority and actually never happened. The work that was being deferred is the work that gets unlocked.
I have specific evidence of this in my own development workflow over the last six months. The number of refactors I’ve shipped is up roughly 4× year-over-year. The number of new features is up modestly, maybe 30%. The compositional improvement is in the codebase quality, not the feature velocity. Which, frankly, is the right place for the improvement to land — features are downstream of a codebase you can reason about.
Where this falls apart
Honest about the limits, because they exist and they matter.
The pattern works best when the task decomposes cleanly into parallel subtasks. Refactoring is one of the cleanest cases — different parts of the codebase can be analyzed and modified independently, and the orchestrator’s job is mostly to keep them coherent. Tasks that don’t decompose cleanly — fundamentally sequential reasoning chains, where each step depends on the previous step’s output — don’t benefit from the fan-out and don’t see the same compression. Debugging a complex production incident is often more sequential than parallel; the RLM pattern helps less there.
The pattern requires the underlying codebase to be amenable to local reasoning. If your codebase has implicit state everywhere, undocumented coupling between modules, or behavior that depends on environment-specific configuration not visible to the agents, the scouts will produce locally correct changes that globally don’t work. The pattern is not a substitute for code quality. It works dramatically better on a codebase where files have meaningful boundaries and dependencies are mostly explicit. Which, to be fair, is a property you should want for human engineers anyway, but the AI pattern surfaces the absence of that property faster.
The orchestrator can be wrong. When the orchestrator makes a bad dispatch decision — decomposes the work in a way that doesn’t actually parallelize, or sends scouts after the wrong subgoals — the entire job degrades, sometimes silently. The recursion mechanism helps with this (the orchestrator can notice it didn’t get a clean synthesis and re-dispatch), but the failure mode of “orchestrator confidently produces a coherent-looking but wrong plan” is real. The mitigation is to run the orchestrator at higher-capability tier than the scouts, which is what the pattern recommends anyway. The cost is acceptable because the orchestrator is only a few turns of the total work.
The pattern doesn’t help with the work the engineer should be doing. Architectural decisions, product judgment, deciding what the right thing to refactor toward — these are not tasks you dispatch to a scout. The RLM pattern compresses execution. It does not compress judgment. An engineering org that uses this pattern well will find their engineers spending more time on the work that requires actual thought, because the execution overhead of the work that didn’t require thought has dropped. An org that misuses it will produce a lot of automated technical debt very efficiently.
Audit and verification still matter. The orchestrator validates that the scouts didn’t conflict with each other and that the test suite passes. It does not validate that the test suite is sufficient. It does not validate that the refactor was the right refactor to do. The engineer reviewing the diff is still doing the work of “is this actually what I wanted.” If you skip that step, you are running the pattern wrong, and you will eventually ship something subtly broken at high speed. The compression is in the execution. The judgment is still yours.
What this implies for 2026
I’ll close with the structural implication, because the field-report data is interesting on its own but the question I keep getting from engineering leaders is “what does this mean for how I should staff and structure my team in 2026.”
The RLM-on-edge-inference pattern compresses execution time for code-heavy tasks by something like 50× relative to manual work, at a cost reduction in the same order of magnitude. This is not a productivity improvement of the kind that requires reorganization. It is a change in the unit economics of certain categories of engineering work, and the categories where it applies most strongly (refactors, audits, test backfilling, documentation, migrations) are categories that most engineering orgs have historically been chronically under-resourced on.
The first-order effect, which I think most orgs will internalize in 2026, is that one engineer becomes capable of operating at the throughput of a small team for these categories of work. The second-order effect, which most orgs will internalize in 2027, is that the bottleneck shifts from execution to direction-setting. When ten engineers can each do five times as much execution as they used to, the constraint becomes “what should they be doing.” That is a leadership problem, not an engineering problem, and most engineering leaders are not yet structured to operate at the increased decision throughput their teams now require of them.
The third-order effect, which I’d bet on for 2028 but won’t insist on, is that the optimal team size for many software products gets meaningfully smaller. Not because engineers are being replaced, but because the coordination overhead of a large team starts to exceed the marginal output it produces when each engineer’s individual throughput has multiplied. The mid-sized engineering org of 2023 — 40 to 80 engineers, multiple layers of management, dedicated platform and infrastructure teams — is not obviously the right shape for a 2028 product company. I would not bet against 15-person companies shipping product surface that today requires 80.
The architecture under all of this — orchestrator-plus-scout running on pay-per-inference edge compute, with stateful coordination in durable lightweight runtimes — is the architecture that makes the unit economics work. Recursive Language Models are the inference pattern that makes the work tractable. The combination is what’s new. The implications are still being worked out by everyone, including me.
This is the field report. The numbers are real. The pattern works. The implications are interesting. The work continues.