AI Agents

Building Production AI Agents on Cloudflare: The Complete Stack

Tony Adams10 min readMay 2026

Most “AI agent” demos are a single model call wearing a costume. A production agent is something else entirely: it holds state across interactions, runs multi-step work that can fail and recover, calls real tools and external systems, pauses for a human on high-stakes actions, and keeps running across restarts and deploys. The hard part was never the model. It’s everything around the model — the state store, the retry logic, the scheduler, the tool permissions, the cost controls, the observability — that turns a clever prompt into a system you can put in front of a customer.

On Cloudflare you build that system as a stateful Durable Object using the Agents SDK: each agent is its own addressable micro-server with embedded SQL, it runs durable multi-step work through Workflows, it acts through MCP tools, a browser, or sandboxed code, and it reaches models through one unified inference binding — Workers AI plus AI Gateway. After Cloudflare’s first Agents Week in April 2026, those pieces stopped being a loose collection of products and became a coherent nine-part stack. This article maps that stack, shows how each part answers a specific production requirement, and is honest about where the trade-offs are.

// THE SHORT VERSION

A production agent needs seven things a demo skips: durable state, error recovery, observability, cost control, human-in-the-loop, tool security, and bounded scope. Cloudflare maps each to a native primitive — that mapping is the whole argument for building agents here.
The fit is structural, not marketing. An agent is one-to-one — one instance, one task, one user — and the stateless-function model the cloud was built on doesn’t serve that shape well. A Durable Object gives each agent persistent state and identity with zero cost while idle.
Most agent projects fail, and the data is blunt about it. The failures cluster on reliability, state, and scope — not model quality — which is exactly the case for building them durable, observable, and bounded from day one.
Be clear-eyed about maturity: several primitives are GA, but Agent Memory is in private beta, AI Search and MCP portals are in open beta, and Project Think is a preview. Status changes monthly; verify before you commit a roadmap to any one of them.

Why the cloud’s original model breaks for agents

The cloud was built for applications that serve many users from shared, stateless infrastructure. Agents invert that. Each agent is a distinct instance doing one task, and if a meaningful fraction of knowledge workers each run a few in parallel, you need capacity for tens of millions of simultaneous sessions — a workload shape where, as Cloudflare’s own framing puts it, the one-app-serves-many-users model “doesn’t work for that.”

Stateless functions like AWS Lambda make this awkward: no GPU, no persistent per-instance memory, and a hard execution timeout, so you bolt on a database, a queue, a cron service, a session store, and a vector index, then operate all five. Always-on VMs or containers solve the state problem but bill you around the clock for capacity that sits idle most of the time. A Durable Object is the middle path: each agent is a stateful micro-server with its own embedded SQLite, it hibernates when idle (costing nothing), wakes on an event, and resumes exactly where it left off. Cloudflare’s own illustration is stark — ten thousand agents each active one percent of the time means ten thousand always-on instances on a VM model, but only about a hundred active at any moment as Durable Objects. The compute-and-cost math behind that is its own subject; we cover it in depth in the GPU and edge-inference economics breakdown and Workers vs. Lambda for AI inference.

What separates a production agent from a demo

Seven requirements decide whether an agent survives contact with real users. The architecture’s value is that each maps directly onto a first-party primitive, rather than something you assemble and maintain yourself.

// The seven production-agent requirements, each mapped to a Cloudflare primitive
Requirement	What it means	Cloudflare primitive
Durable state	Survives restarts, deploys, failures	Agents SDK (Durable Object + SQL), Agent Memory
Error recovery	Automatic, idempotent, step-level retries	Workflows
Observability	Logs, traces, token and cost analytics per call	AI Gateway
Cost control	Scale-to-zero, caching, budget caps, no idle spend	Workers AI + AI Gateway
Human-in-the-loop	Approval gates that can wait hours or days	Workflows `waitForEvent`
Tool security	Scoped OAuth, least privilege, egress control	MCP + Access + Mesh
Bounded scope	A defined task boundary and guardrails	Code Mode, scoped tokens

The rest of this article walks the stack that delivers those seven.

The Cloudflare agent stack, layer by layer

State and execution

The Agents SDK is the foundation: an agent is a Durable Object, which gives it a stable identity, its own embedded SQLite for state, WebSocket support with hibernation so it costs nothing while idle, and built-in scheduling via alarms. As of late 2025 the SDK shipped long-running sessions that don’t get evicted mid-task — enough for an agent to clone a repository, run a full test suite, and open a pull request in a single continuous session. How that state model actually works in code is the subject of the agent-as-Durable-Object deep dive, and the underlying primitive is covered in our Durable Objects explainer.

Workflows handles durable execution: multi-step processes where each step is independently retryable and replay-safe, so a failure resumes from the last successful checkpoint rather than restarting the whole job. Its rearchitected control plane now supports fifty thousand concurrent instances and three hundred new instances per second, and a step can sleep — or wait for an external event — for up to a year at no cost while paused, which is how you build human approval gates. The full picture, including the AgentWorkflow class that lets a workflow and an agent talk to each other, is in durable execution for agents with Workflows. For agents that orchestrate other agents recursively, see recursive orchestration in production.

Reaching models: inference

Workers AI runs serverless GPU inference across Cloudflare’s global network, and the unified env.AI.run() binding now reaches models from roughly a dozen-plus providers — Cloudflare’s own materials cite between twelve and fourteen — through one line of code, with a one-line switch between them and a single set of credits. Open-weight models in the catalog are economically serious: Cloudflare runs an internal security agent processing more than seven billion tokens a day on Kimi K2.5 at an estimated 77% less than a mid-tier proprietary model would cost. Why edge inference changes the latency and cost equation is covered in edge AI vs. centralized inference and the Workers vs. Lambda cost analysis.

AI Gateway sits in front of all of it as the observability and reliability layer: logging, caching, rate limiting, unified billing, automatic failover to another provider on an outage, and streaming resilience — it buffers a model’s streaming response independently of the agent’s lifetime, so an interrupted agent can reconnect and retrieve the output without re-running the inference or paying for the same tokens twice. One honest caveat worth stating plainly: AI Gateway is not an agent-egress firewall, and prompt-injection detection is not a documented Guardrails feature today. Treat tool security as an architectural problem, not something the gateway solves for you.

Acting on the world: tools, browser, and code

MCP hosting is how agents call tools. You build a remote MCP server as a Durable Object, secure it with OAuth, and front it with an MCP Server Portal that centralizes authentication and observability through Cloudflare Access. The standout efficiency win is Code Mode: rather than loading every tool’s schema into the context window, the model writes code against a typed SDK — Cloudflare’s own API server collapses 2,500-plus endpoints into two tools at roughly a thousand tokens instead of over a million. Building and hosting your own is covered end to end in host an MCP server on Cloudflare, and the architectural case for MCP-first design is in why we build MCP-first.

Browser Run (renamed from Browser Rendering during Agents Week) is the web action layer — now with four times the concurrency at 120 simultaneous browsers, plus Live View, human-in-the-loop handoff, session recordings, a direct Chrome DevTools Protocol endpoint, and WebMCP support. Sandboxes, which reached general availability in April 2026, give each agent a real isolated Linux computer — shell, filesystem, background processes — that starts on demand and bills only for the CPU cycles it actually uses, with an egress proxy so the agent itself never sees a credential. Both the browser and code-execution layers are covered in browser and sandboxed-code agents.

Memory and grounding

Agent Memory is a managed service that extracts what matters from an agent’s conversations and surfaces only the relevant pieces later, instead of stuffing everything back into the context window — it is, in effect, long-term memory as a managed primitive. AI Search (the successor to AutoRAG) is the retrieval primitive: hybrid semantic-plus-keyword search over your own documents, with storage and indexing built in and instances you can create at runtime, one per customer or tenant. This is where retrieval-augmented generation lives on the platform now. Both are covered in agent memory and grounding — with the important caveat that Agent Memory is in private beta and AI Search in open beta, so confirm their status before you design around them.

Security and identity

Tool security is the requirement most demos ignore and most production incidents trace back to. The platform addresses it at the architecture level: Managed OAuth for Access lets an agent authenticate on behalf of a user to internal apps without insecure shared service accounts, Cloudflare Mesh with the Workers VPC binding grants an agent scoped reach into a private network through one line of config, resource-scoped permissions enforce least privilege, and the Sandbox egress proxy keeps credentials out of the agent’s hands entirely. The practical patterns for locking down a tool-calling agent are covered alongside host an MCP server on Cloudflare.

Most agent projects fail. Here’s why that argues for building them right.

The market data is unforgiving, and it should shape how you build. Gartner predicts that more than 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear value, and inadequate risk controls — and notes that of thousands of vendors claiming “agentic” capabilities, only around 130 are real.

40%

of agentic AI projects Gartner predicts will be canceled by the end of 2027 — on cost, unclear value, and inadequate risk controls— Gartner, June 2025

An MIT study the same year found that the large majority of integrated generative-AI pilots showed no measurable profit-and-loss impact, attributing the gap to systems that don’t learn or persist rather than to weak models. McKinsey’s 2025 survey found nearly nine in ten organizations using AI somewhere, but fewer than a quarter scaling agents in any single function — and the high performers were almost three times as likely to have defined human-in-the-loop validation.

Read together, the failures are not about model quality. They’re about reliability, state, scope, and governance — precisely the seven requirements above. That is the argument for durable execution, persistent state, observability, scoped tool security, approval gates, and bounded scope as the starting point rather than the afterthought. None of this is theoretical demand, either: Salesforce reported its Agentforce and Data 360 line reaching roughly $1.4 billion in annual recurring revenue, up 114% year over year, across more than 9,500 paid deals.

$1.4B

Salesforce Agentforce + Data 360 annual recurring revenue, up 114% year over year across 9,500+ paid deals— Salesforce FY26 Q3, December 2025

The money is real; the discipline is what separates the projects that keep it from the projects that get canceled. We make the broader version of this case in why AI pilots fail in the mid-market.

Cloudflare vs. the alternatives

If you build directly on a framework like LangGraph, you get battle-tested orchestration but you bring your own hosting, state store, scheduler, scaling, and operations — the parts that are most of the work. AWS Bedrock AgentCore is the closest hyperscaler equivalent, but it is heavier and AWS-centric, and its components have been moving from preview toward general availability on a different timeline. The Vercel AI SDK is complementary rather than competing — Cloudflare’s own agent classes use it under the hood.

The honest trade-off with Cloudflare is vendor concentration: Durable Objects have no drop-in equivalent on AWS or GCP today, and their cold-start behavior is something to design around for latency-sensitive cases like live voice. For most commercial workloads that isn’t a real constraint, and the mitigation is straightforward — keep your application logic and data model portable, and treat Agent Memory’s exportability as a hedge. The full comparison, including where each option genuinely wins, is in Cloudflare agents vs. LangGraph.

Go deeper

Each layer of the stack has a dedicated breakdown:

The agent as a stateful Durable Object — identity, embedded SQL, hibernation, and scheduling.
Durable execution for agents with Workflows — checkpoints, retries, and human-in-the-loop gates.
Host an MCP server on Cloudflare — building, securing, and scaling tool servers with Code Mode.
Browser and sandboxed-code agents — the action layer for the web and for running code.
Cloudflare agents vs. LangGraph — and where AWS Bedrock AgentCore fits.
Real-time voice agents — continuous speech-to-text and text-to-speech, and the latency considerations.
Agent memory and grounding — persistent memory and retrieval over your own data.

Frequently asked

Do I need all nine parts of the stack to build an agent?

No. A simple agent might be a Durable Object with state and a couple of tools. The stack is the menu for what production hardening looks like as the agent takes on more responsibility — you add Workflows when steps can fail, Agent Memory when context matters across sessions, and so on.

Is this production-ready or still experimental?

It's mixed, and you should know which is which. The core — Agents SDK, Workflows, Workers AI, AI Gateway, Sandboxes — is generally available. Agent Memory is in private beta, AI Search and MCP portals are in open beta, and Project Think (the next SDK edition) is a preview. Build on the GA primitives and treat the betas as roadmap, not foundation.

How does this compare to just using OpenAI's or Anthropic's agent tooling directly?

Those give you excellent models and a development framework, but not the hosting, durable state, scheduling, and tool-security layers. Cloudflare provides the runtime around the model; you still call whichever models you want through it.

What's the fastest way to know if my use case fits?

If your agent needs to remember things, run for a while, retry on failure, act on tools, and stay within budget, it fits. If it is a one-shot question-and-answer with no state, you may not need an agent platform at all.

Working with Truvisory

If you’d rather have a production agent built and shipped than assemble this stack yourself, this is the work we do: senior-engineer-led, fixed-scope agent systems on Cloudflare, with the seven requirements built in from day one. We don’t deliver strategy decks; we ship software you own — and we’ll tell you honestly which primitives are GA and which are still beta before we design around them. If you’re weighing an agent build, start with a scoping call.

Tony Adams is the founder of Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. SBA-verified SDVOSB and VOSB, SAM.gov-registered.

Building Production AI Agents on Cloudflare: The Complete Stack

Why the cloud’s original model breaks for agents

What separates a production agent from a demo

The Cloudflare agent stack, layer by layer

State and execution

Reaching models: inference

Acting on the world: tools, browser, and code

Memory and grounding

Security and identity

Most agent projects fail. Here’s why that argues for building them right.

Cloudflare vs. the alternatives

Go deeper

Frequently asked

Working with Truvisory

More in this series

Agent Memory and Grounding on Cloudflare: Long-Term Memory vs. RAG, and When to Use Each

The Cloudflare Agent Is a Durable Object: How Per-Agent State Works

Browser and Code-Execution Agents on Cloudflare: How Agents Act with Browser Run and Sandboxes

Cloudflare Agents vs. LangGraph (and Where AWS Bedrock AgentCore Fits): Framework vs. Runtime

Durable Execution for AI Agents: How Cloudflare Workflows Makes Multi-Step Work Survive Failure

How to Build and Host a Remote MCP Server on Cloudflare (McpAgent, OAuth, Portals, Code Mode)

Real-Time Voice Agents on Cloudflare: The Streaming Speech-to-Text → LLM → Text-to-Speech Loop

One email a month. Not a vendor blog.