Durable Execution for AI Agents: How Cloudflare Workflows Makes Multi-Step Work Survive Failure
A demo agent makes one model call and returns. A production agent chains ten — classify, plan, call a tool, call another, write to a database, send an email — and any one of them can fail: a rate limit, a timeout, a dropped connection, a redeploy mid-run. Durable execution is what keeps that chain alive. On Cloudflare you get it from Workflows: every step’s result is checkpointed, so when something fails, the run resumes from the last successful step instead of starting over — no re-charging the customer, no re-sending the email, no re-paying for a model call you already made. Workflows is how an agent’s multi-step work survives contact with the real world, and it’s the platform’s answer to two of the seven requirements a production agent has to meet: error recovery and human-in-the-loop.
This is the deep dive behind the durable-execution layer of our guide to building production AI agents on Cloudflare, and the companion to the agent-as-Durable-Object spoke, which handed off exactly here: for work that should fail and retry one step at a time rather than run as one long session, the right tool is a Workflow.
What durable execution means, and how Workflows does it
A Workflow is a class extending WorkflowEntrypoint with one method, run(event, step). Inside it, you wrap each unit of work in step.do(name, callback). The engine runs the callback, persists its return value, and moves on; if the workflow is interrupted and later resumes, completed steps are skipped and replayed from their saved results. Each step.do() is, in effect, a checkpoint. Under the hood every workflow instance runs on a SQLite-backed Durable Object that stores those checkpoints — the same primitive the Durable Objects deep dive covers, so we won’t re-explain it here.
A minimal workflow shows the shape:
import { WorkflowEntrypoint, WorkflowEvent, WorkflowStep } from "cloudflare:workers";
import { NonRetryableError } from "cloudflare:workflows";
export class OrderWorkflow extends WorkflowEntrypoint<Env, Params> {
async run(event: WorkflowEvent<Params>, step: WorkflowStep) {
const order = await step.do("validate order", async () => {
if (!event.payload.orderId) throw new NonRetryableError("orderId required");
return await fetchOrder(event.payload.orderId);
});
await step.do("charge payment",
{ retries: { limit: 3, delay: "5 seconds", backoff: "exponential" } },
async () => chargeCustomer(order));
await step.sleep("cooldown", "2 days");
await step.do("fulfill", async () => fulfill(order));
}
}Two things in that snippet do real work. The retry config — limit, delay, and a backoff of constant, linear, or exponential — turns a transient failure into an automatic retry of just that step. And NonRetryableError is the escape hatch: throw it for a terminal error like bad input or a failed auth, and the engine stops retrying immediately instead of burning through the budget.
The one discipline durable execution demands is idempotency. Because a step can run more than once, its body should be safe to repeat — and before a non-reversible side effect like charging a card or sending an email, you check whether it already happened. Cloudflare’s own guidance is blunt about it: don’t cram all your logic into one step, and don’t hold state outside a step, because a workflow can hibernate and lose everything in memory.
The demo that breaks on the first flaky call
Here’s why this matters for agents specifically. String together ten tool and model calls, each individually reliable, and the end-to-end run still fails at a meaningful rate. Without checkpointing, one flaky call restarts the whole thing — re-running the expensive model calls and, worse, re-firing side effects that should happen exactly once. Cloudflare’s reference pattern for a durable agent wraps each model turn and each tool call in its own step.do(), precisely so that after the model responds, the response is saved and a resume skips straight to tool execution; and after a tool runs, its result is saved so a later failure doesn’t re-run it. An agent that sends an email twice because it crashed and retried isn’t a rounding error to the person who got the email.
This is the pillar’s error-recovery requirement made concrete. The flashy demo that impresses in the room and dies on the first upstream rate limit is the default outcome when there’s no durable execution underneath. Exponential backoff on a checkpointed step turns those upstream hiccups — rate limits, brief outages, dropped connections — into recovery the user never sees.
Agent or Workflow? Where each piece of logic goes
The most common confusion is which layer owns what. Cloudflare’s own rule of thumb: agents are for real-time communication and state; workflows are for durable execution, retries, and waiting. In practice:
| Put it in… | When |
|---|---|
| The Agent (Durable Object) | Live chat, per-agent state, WebSocket streaming, quick (under ~30s) calls |
| A Workflow | Multi-step pipelines that retry step-by-step, run for minutes to weeks, wait for approval, or must survive a crash |
| A long-running agent session | The work is genuinely one continuous session, not discrete retryable steps |
The agent owns state; the workflow owns the pipeline. The deep treatment of agent state and the long-running-session option lives in the agent-as-Durable-Object spoke — this spoke won’t re-derive it. What connects the two is the AgentWorkflow class, added to the Agents SDK in early 2026, which lets an agent spawn a workflow and receive progress, state updates, and approvals back. An agent kicks off a run and tracks it:
const instanceId = await this.runWorkflow("EXPENSE_WORKFLOW", params, { metadata: { userId } });
// later, from a human action:
await this.approveWorkflow(instanceId, { metadata: { approvedBy: userId } });Inside the workflow, reportProgress() streams non-durable updates to connected clients, while durable step methods like step.updateAgentState() persist changes that won’t replay on retry.
Pausing for a human, a clock, or an event
The capability that separates durable execution from a retry loop is the ability to stop and wait without holding resources. A workflow can call step.waitForEvent() and pause until an external event arrives — a webhook, a callback, or a human clicking approve — with a timeout you set anywhere from one second to a year (the default is 24 hours). You deliver the event to a waiting instance through the Workers binding, the REST API, or Wrangler, and events that arrive early are buffered until the workflow reaches the wait. The Agents SDK wraps this in a waitForApproval() helper that returns the approval payload or throws if the request is rejected — Cloudflare’s example uses a seven-day window for an expense approval.
The economic point is the one that makes this practical: a workflow that is sleeping, waiting for a retry, or waiting for an event does not count against your concurrency limit and burns no CPU time while it waits. So gating a high-stakes action — a money transfer, a data deletion, a public post — behind a human approval costs nothing while the approval sits in someone’s inbox for three days. The same mechanism handles scheduled sequences: step.sleep("wait", "2 days") or sleepUntil(date) pauses for up to a year, which is how you build a multi-day onboarding or follow-up flow without a separate cron service.
Limits, pricing, and status
A 2026 rearchitecture of the Workflows control plane raised the ceilings substantially: 50,000 concurrent instances, 300 new instances per second per account, and two million queued instances per workflow. A single instance can run up to 10,000 steps by default — raisable to 25,000 — each step can persist up to 1 MiB of state, and a step can sleep for up to 365 days.
Pricing follows the Workers model rather than wall-clock: you pay for CPU time, invocations, and stored state. A workflow that’s sleeping, waiting on an API, or paused for approval consumes no CPU and costs nothing while idle. Production runs on the Workers Paid plan, which carries a $5-per-month account minimum, and persisted state is billed per gigabyte-month. Workflows has been generally available since April 2025, with the control-plane rearchitecture landing in April 2026.
One housekeeping note for anyone checking the docs alongside this: the published limits reference table lagged the 2026 rearchitecture at the time of writing, still showing the older concurrency numbers — the rearchitecture announcement is the authoritative source for the current figures. Re-verify both before relying on a specific number.
Workflows vs. the alternatives
Durable execution isn’t unique to Cloudflare; the question is operational cost and fit.
| Engine | Model | Ops burden | The trade-off |
|---|---|---|---|
| Cloudflare Workflows | Code-first TS/JS on Workers | None — no cluster to run | Cloudflare concentration; young platform |
| AWS Step Functions | JSON-based state-machine DSL | Managed by AWS | AWS-centric; you write a DSL, not code |
| Temporal | Code, language-agnostic | Operate a cluster, or pay Temporal Cloud | Very powerful; heavier day-one cost |
| Inngest / DBOS | Code-first | Low (Inngest serverless; DBOS uses your Postgres) | Best for event-driven or single-service apps |
The honest summary: Workflows is the lowest-operations durable-execution engine if you are already on Cloudflare, and it’s the only one with a first-class Agents-SDK integration — the cost is Cloudflare concentration and a platform that’s only been GA since 2025. Step Functions is a reasonable default inside AWS but locks you into its DSL and ecosystem; Temporal is the right answer for heavy cross-service orchestration or multi-cloud portability, if you can absorb the operational weight. The deeper comparison against agent frameworks like LangGraph and AWS Bedrock AgentCore lives in its own spoke.
The honest trade-offs
- Idempotency is your job. Because steps retry, you guard non-reversible side effects yourself. This is real design work, not a checkbox.
- The 1 MiB step-output cap is the common foot-gun. A
step.do()return value over 1 MiB fails; large outputs go to R2 with a reference returned instead. - Vendor concentration. Workflows is Cloudflare-specific — there’s no porting to Step Functions or Temporal without a rewrite. Weigh that against the operational simplicity it buys.
- Young platform, fast change. GA in 2025, rearchitected in 2026, with limits and APIs that move monthly. The
AgentWorkflowintegration rides on the pre-1.0 Agents SDK, which changes often. Pin versions and expect to re-check.
Frequently asked
How is this different from writing my own retry logic in the agent?
Does a workflow cost money while it waits for an approval?
When should I use a long-running agent session instead of a Workflow?
Is Cloudflare Workflows production-ready?
Working with Truvisory
If you’d rather have a durable, retry-safe agent pipeline built and shipped than wire up the step model and approval gates yourself, this is the work we do: senior-engineer-led, fixed-scope agent systems on Cloudflare, with durable execution designed in from the first step. See how we deliver agent systems, or read the pillar guide to the full stack.