Skip to main content
Truvisory
AI Agents

Agent Memory and Grounding on Cloudflare: Long-Term Memory vs. RAG, and When to Use Each

Tony Adams 10 min read

An agent needs two different things that builders constantly run together, and getting them straight is half the battle. One is memory — what the agent remembers about you: your preferences, decisions from past conversations, facts it has learned over time. The other is grounding, or RAG — what the agent can look up in an authoritative body of documents: your policies, your product docs, your knowledge base. They’re not the same, they have different lifespans, and they’re solved by different tools. On Cloudflare, memory is the managed Agent Memory service and grounding is AI Search (the renamed AutoRAG), both built on Vectorize, the vector database you can also drop down to directly. Cloudflare draws the line the same way: AI Search finds results across your files, Agent Memory recalls context derived from sessions, and “an agent can use both.”

This is the memory-and-grounding deep dive behind our guide to building AI agents on Cloudflare, and it’s the answer to the question the agent-as-Durable-Object spoke hands off: when do you graduate beyond the agent’s built-in state?

Memory, RAG, working state, and the context window — four different things

It’s worth being precise, because these get conflated constantly:

// Four things builders conflate — memory, RAG, working state, and the context window
Layer What it is Lifespan
The context windowThe model’s finite input bufferA single call
Built-in agent stateThe agent’s working data for the current sessionThe session, until eviction
Agent MemoryLearned facts about a user, team, or project, retrieved by relevanceAcross sessions and restarts
Grounding / RAGRetrieval over an authoritative document corpusAs durable as the corpus

The agent’s built-in working state — the this.state and SQL covered in the AG-A spoke — is short-term, current-session memory, and this article won’t re-derive it. The question is what you reach for when that isn’t enough.

The reason you can’t just pour everything into the prompt is context rot: as the input gets longer, model performance degrades — and it degrades well before the window is full, even on the newest large-context models. Chroma’s 2025 study of eighteen leading models, including the then-current GPT, Claude, and Gemini frontier models, found every one of them losing accuracy as input length grew. Cloudflare makes the same point bluntly about its memory service: even past a million tokens of context, context rot is still unsolved. The fix isn’t a bigger window; it’s retrieving only what’s relevant.

Agent Memory: long-term memory as a managed service

Agent Memory, announced on the final day of Agents Week 2026, extracts structured memories from conversations and surfaces only the relevant ones on demand. Memories live in a named profile, with a small APIremember to store a fact on the spot, ingest for the bulk path (typically called when the agent’s harness compacts its context), recall to run retrieval and get a synthesized answer back, plus list and forget. Under the hood it classifies what it stores into four types — facts, events, instructions, and tasks — and keys facts and instructions so a newer memory on the same topic supersedes the old one, which is how the agent’s picture of you stays current instead of accumulating contradictions.

The part that’s genuinely hard to replicate is the retrieval. Rather than a single vector lookup, Agent Memory runs several channels in parallel — full-text search, exact fact-key lookup, raw-message search, direct vector search, and a hypothetical-answer vector search that catches cases where you and the stored memory used different words — and fuses them with reciprocal rank fusion before a model synthesizes the result. It’s built entirely on Cloudflare’s own primitives: a Durable Object per profile holds the messages and classified memories (giving each tenant hard isolation), Vectorize does the vector search, and Workers AI runs the extraction and synthesis models. Notably, Cloudflare found that a bigger model wasn’t always better — it defaults to a mid-sized model for extraction and classification and reserves a large one only for the final synthesis step.

When is the managed service worth it over rolling your own memory on the agent’s SQLite? The built-in Sessions API already gives you conversation history and a searchable context store, which is plenty for a single session and modest cross-session needs. Agent Memory earns its place when you need relevance-ranked recall across many sessions, structured extraction and supersession you’d otherwise hand-build, or memory shared across several agents — and when you can accept its current status, which brings us to the honest caveat: it’s private beta, access is by waitlist, and pricing hasn’t been announced. Prototype against it; don’t build production on it yet.

AI Search: grounding in your own documents

AI Search is Cloudflare’s managed, end-to-end RAG service. It began life as AutoRAG and was renamed in September 2025 — a change that was more than cosmetic, since it added support for external model providers through AI Gateway alongside the Workers AI models. You point it at a data source — an R2 bucket, a crawled website, or files you upload directly — and it converts content to Markdown, chunks it, embeds it, and indexes it; at query time you call /search to get back the relevant chunks, or /ai-search to get a generated answer with those chunks. A major April 2026 upgrade added hybrid retrieval (vector search and keyword BM25 search run together and fused), built-in per-instance storage so there’s no R2 bucket to wire up, and an ai_search_namespaces binding that lets you create instances at runtime — the primitive for spinning up one search index per customer.

On capacity, there’s a discrepancy worth knowing about: the docs limits page is stale, and the current figures are in the April 2026 launch announcement. For new managed instances that means, on the paid plan, up to 5,000 instances and a million files per instance — though enabling hybrid search halves that file ceiling to 500,000 — with a 4 MB per-file cap; the free plan allows 100 instances and 100,000 files. AI Search is open beta and free to enable today; Cloudflare has committed to at least 30 days’ notice before any billing starts, and instances created before the April 2026 upgrade are being migrated to managed infrastructure on a set date in June 2026, after which the bundled crawling stops billing separately. Workers AI and AI Gateway usage are always billed separately.

Vectorize: the primitive underneath

Both managed services are built on Vectorize, Cloudflare’s vector database, which has been generally available since late 2024 and is the production-ready, build-it-yourself option. You reach for it directly when you need full control that the managed services don’t expose — your own chunking strategy, a specific embedding model or distance metric, or custom retrieval that joins a vector hit to an R2 object or a D1 row. A single index holds up to ten million vectors at up to 1,536 dimensions, with metadata filtering, and it bills purely on queried and stored vector dimensions rather than on instances or uptime. The trade-off is that you’re now operating the chunk-embed-index-retrieve pipeline yourself — which is exactly the plumbing AI Search exists to remove. Choose Vectorize for control; choose AI Search for convenience.

So which do you use?

// What you need vs. which Cloudflare memory or retrieval primitive to reach for
What you need Reach for
Short-term, current-session working dataThe agent’s built-in state (AG-A)
Long-term, relevance-ranked memory about a user or teamAgent Memory (managed; private beta)
Retrieval/grounding over your own documentsAI Search (managed; open beta)
Full control over chunking, embedding, and retrievalVectorize (the GA primitive)

And the direct answer to “when do you graduate beyond built-in state”: when history outgrows what you want to hand-roll in SQL; when you need relevance-ranked recall rather than replaying the whole conversation; when memory has to span sessions or agents; or when you need to ground answers in a document corpus. The first three point to Agent Memory, the last to AI Search or Vectorize.

A few of the sibling pieces connect here directly. Indexing a large corpus is multi-step work that benefits from running as a durable Workflow so a failed batch retries cleanly. Both Agent Memory and AI Search can be exposed to an agent as MCP tools. The Browser Run crawler is one of the ways you feed documents into AI Search. And voice agents need memory just as much as text ones do.

Why Cloudflare for memory and grounding

The argument is the same colocation one that runs through the whole stack, applied to data. Memory and retrieval run on the same global network as the agent, the embedding and generation models on Workers AI, and the storage in R2, Vectorize, and D1 — so a retrieval doesn’t pay a cross-cloud hop. The managed services remove the parts most teams underestimate: AI Search collapses the vector index, the indexing pipeline, the keyword index, and the fusion logic into one instance, and Agent Memory hides the extraction, verification, and multi-channel retrieval entirely. Cloudflare’s own proof point is that it built a productionized internal version of Agent Memory on these primitives in under a month.

< 1 month
for Cloudflare to build a productionized internal version of Agent Memory on its own primitives — Durable Objects, Vectorize, and Workers AI — Cloudflare Agent Memory blog

You also get per-use economics and first-class multitenancy — one AI Search instance per customer created at runtime, one isolated Durable Object and vector index per memory profile. That’s the pillar’s memory-and-grounding block made concrete.

The honest trade-offs

  • Agent Memory is private beta with no pricing. This is a real “not for production yet” caveat, not a formality. Its retrieval quality is the draw; its maturity is the constraint. Join the waitlist and prototype, but don’t ship a business on it until it’s GA and priced.
  • AI Search is open beta. Free now, but pricing is coming on at least 30 days’ notice, and pre-upgrade instances are being force-migrated to managed infrastructure in June 2026. Plan for both.
  • Managed means less control. You give up fine control over chunking, embedding model, and distance metric versus dropping down to Vectorize. If a measured retrieval gap traces to that, that’s your signal to go lower-level.
  • RAG quality is hard regardless of platform. Chunking, retrieval relevance, and evaluation all matter, and no managed service does the evaluation for you. Build a retrieval eval harness — golden question-to-expected-chunk pairs — on day one. It’s the highest-leverage and most platform-independent thing you can do.
  • Vendor concentration. Agent Memory says every memory is exportable, but exporting raw facts isn’t the same as porting a tuned retrieval pipeline. Weigh the lock-in against how much plumbing the managed services save you.
  • Capacity ceilings are real. Hybrid search halves the per-instance file cap; a single Vectorize index tops out at ten million vectors. Very large corpora may need sharding or a limit increase.
  • Memory raises privacy questions. You’re persisting facts about people. Decide deliberately what you store, set retention, and lean on the explicit remember path for critical facts rather than trusting automatic ingestion to capture everything — and treat PII in stored memory as the sensitive data it is.

Concrete patterns

Querying AI Search from a Worker is a single call — get the binding and ask for an answer or for chunks:

// AI Search: generated answer with citations
const result = await env.AI.aiSearch("acme-docs").aiSearch({ query: "What is our refund window?" });
// or just the retrieved chunks:
const { data } = await env.AI.aiSearch("acme-docs").search({ query: "refund window" });

Agent Memory follows a remember/recall shape (treat this as a sketch — the API is experimental and the import paths may change before it graduates):

const profile = env.MEMORY.getProfile(`user:${userId}`);
await profile.remember("Prefers email over phone; timezone is Mountain.");
const { answer } = await profile.recall("How does this user like to be contacted?");

And dropping to Vectorize directly gives you the raw insert-and-query:

await env.VECTORIZE.upsert([{ id: "doc-1", values: embedding, metadata: { source: "policy" } }]);
const matches = await env.VECTORIZE.query(queryEmbedding, { topK: 5, returnMetadata: true });

For multitenancy, create one AI Search instance per customer through the ai_search_namespaces binding rather than filtering one shared index — it keeps each tenant’s corpus and ranking isolated.

Frequently asked

What's the difference between agent memory and RAG?
Memory is what the agent remembers about a user across conversations — preferences, decisions, learned facts. RAG (grounding) is retrieval over an authoritative document corpus — your policies, docs, knowledge base. Memory is derived from sessions and personal; RAG is your source-of-truth documents. On Cloudflare they are Agent Memory and AI Search respectively, and an agent often uses both.
When should I move off the agent built-in state?
When conversation history outgrows what you want to manage in SQL, when you need relevance-ranked recall instead of replaying everything, when memory must persist across sessions or be shared between agents, or when you need to ground answers in documents. The first three point to Agent Memory; the last to AI Search or Vectorize.
Should I use the managed services or Vectorize directly?
Start managed — AI Search for documents, Agent Memory for long-term memory — because they remove the chunk-embed-index-retrieve plumbing. Drop to Vectorize when you hit a real limit on chunking, embedding choice, or retrieval logic that the managed services do not expose, and you can measure the gap on your evals.
Is this production-ready?
Vectorize is generally available and production-ready. AI Search is open beta and free, solid enough to build on with the understanding that pricing is coming. Agent Memory is private beta with no pricing — prototype only. Build on the mature pieces and adopt the newer ones deliberately.

Working with Truvisory

If you’d rather have an agent with real memory and well-grounded retrieval built and shipped — with the evals and the data model done right — this is the work we do: senior-engineer-led, fixed-scope agent systems on Cloudflare, and we’ll be honest about which pieces are ready for production today. See how we deliver agent systems, or read the pillar guide to the full stack.