AI Agents

Durable Execution for AI Agents: How Cloudflare Workflows Makes Multi-Step Work Survive Failure

Tony Adams8 min readMay 2026

A demo agent makes one model call and returns. A production agent chains ten — classify, plan, call a tool, call another, write to a database, send an email — and any one of them can fail: a rate limit, a timeout, a dropped connection, a redeploy mid-run. Durable execution is what keeps that chain alive. On Cloudflare you get it from Workflows: every step’s result is checkpointed, so when something fails, the run resumes from the last successful step instead of starting over — no re-charging the customer, no re-sending the email, no re-paying for a model call you already made. Workflows is how an agent’s multi-step work survives contact with the real world, and it’s the platform’s answer to two of the seven requirements a production agent has to meet: error recovery and human-in-the-loop.

This is the deep dive behind the durable-execution layer of our guide to building production AI agents on Cloudflare, and the companion to the agent-as-Durable-Object spoke, which handed off exactly here: for work that should fail and retry one step at a time rather than run as one long session, the right tool is a Workflow.

// THE SHORT VERSION

Durable execution means an agent’s multi-step work survives failure. Workflows persists the result of every completed step, so a crash, a redeploy, or a flaky API resumes from the last good step instead of re-running the whole pipeline.
The Agent and the Workflow split the job: the Agent (a Durable Object) owns live conversation and per-agent state; the Workflow owns the durable, retryable, long-running pipeline. The AgentWorkflow class wires them together.
A workflow can pause — for a clock, an external event, or a human approval — indefinitely and at no compute cost, which is how you gate high-stakes actions like sending money or deleting data.
Status is solid but young: Workflows is generally available, and a 2026 rearchitecture raised the ceilings to 50,000 concurrent instances. The numbers and the API move often, so pin versions and re-check the docs.

What durable execution means, and how Workflows does it

A Workflow is a class extending WorkflowEntrypoint with one method, run(event, step). Inside it, you wrap each unit of work in step.do(name, callback). The engine runs the callback, persists its return value, and moves on; if the workflow is interrupted and later resumes, completed steps are skipped and replayed from their saved results. Each step.do() is, in effect, a checkpoint. Under the hood every workflow instance runs on a SQLite-backed Durable Object that stores those checkpoints — the same primitive the Durable Objects deep dive covers, so we won’t re-explain it here.

A minimal workflow shows the shape:

import { WorkflowEntrypoint, WorkflowEvent, WorkflowStep } from "cloudflare:workers";
import { NonRetryableError } from "cloudflare:workflows";

export class OrderWorkflow extends WorkflowEntrypoint<Env, Params> {
  async run(event: WorkflowEvent<Params>, step: WorkflowStep) {
    const order = await step.do("validate order", async () => {
      if (!event.payload.orderId) throw new NonRetryableError("orderId required");
      return await fetchOrder(event.payload.orderId);
    });

    await step.do("charge payment",
      { retries: { limit: 3, delay: "5 seconds", backoff: "exponential" } },
      async () => chargeCustomer(order));

    await step.sleep("cooldown", "2 days");
    await step.do("fulfill", async () => fulfill(order));
  }
}

Two things in that snippet do real work. The retry config — limit, delay, and a backoff of constant, linear, or exponential — turns a transient failure into an automatic retry of just that step. And NonRetryableError is the escape hatch: throw it for a terminal error like bad input or a failed auth, and the engine stops retrying immediately instead of burning through the budget.

The one discipline durable execution demands is idempotency. Because a step can run more than once, its body should be safe to repeat — and before a non-reversible side effect like charging a card or sending an email, you check whether it already happened. Cloudflare’s own guidance is blunt about it: don’t cram all your logic into one step, and don’t hold state outside a step, because a workflow can hibernate and lose everything in memory.

The demo that breaks on the first flaky call

Here’s why this matters for agents specifically. String together ten tool and model calls, each individually reliable, and the end-to-end run still fails at a meaningful rate. Without checkpointing, one flaky call restarts the whole thing — re-running the expensive model calls and, worse, re-firing side effects that should happen exactly once. Cloudflare’s reference pattern for a durable agent wraps each model turn and each tool call in its own step.do(), precisely so that after the model responds, the response is saved and a resume skips straight to tool execution; and after a tool runs, its result is saved so a later failure doesn’t re-run it. An agent that sends an email twice because it crashed and retried isn’t a rounding error to the person who got the email.

This is the pillar’s error-recovery requirement made concrete. The flashy demo that impresses in the room and dies on the first upstream rate limit is the default outcome when there’s no durable execution underneath. Exponential backoff on a checkpointed step turns those upstream hiccups — rate limits, brief outages, dropped connections — into recovery the user never sees.

Agent or Workflow? Where each piece of logic goes

The most common confusion is which layer owns what. Cloudflare’s own rule of thumb: agents are for real-time communication and state; workflows are for durable execution, retries, and waiting. In practice:

// Which layer owns which agent logic — Agent vs Workflow vs long-running session
Put it in…	When
The Agent (Durable Object)	Live chat, per-agent state, WebSocket streaming, quick (under ~30s) calls
A Workflow	Multi-step pipelines that retry step-by-step, run for minutes to weeks, wait for approval, or must survive a crash
A long-running agent session	The work is genuinely one continuous session, not discrete retryable steps

The agent owns state; the workflow owns the pipeline. The deep treatment of agent state and the long-running-session option lives in the agent-as-Durable-Object spoke — this spoke won’t re-derive it. What connects the two is the AgentWorkflow class, added to the Agents SDK in early 2026, which lets an agent spawn a workflow and receive progress, state updates, and approvals back. An agent kicks off a run and tracks it:

const instanceId = await this.runWorkflow("EXPENSE_WORKFLOW", params, { metadata: { userId } });
// later, from a human action:
await this.approveWorkflow(instanceId, { metadata: { approvedBy: userId } });

Inside the workflow, reportProgress() streams non-durable updates to connected clients, while durable step methods like step.updateAgentState() persist changes that won’t replay on retry.

Pausing for a human, a clock, or an event

The capability that separates durable execution from a retry loop is the ability to stop and wait without holding resources. A workflow can call step.waitForEvent() and pause until an external event arrives — a webhook, a callback, or a human clicking approve — with a timeout you set anywhere from one second to a year (the default is 24 hours). You deliver the event to a waiting instance through the Workers binding, the REST API, or Wrangler, and events that arrive early are buffered until the workflow reaches the wait. The Agents SDK wraps this in a waitForApproval() helper that returns the approval payload or throws if the request is rejected — Cloudflare’s example uses a seven-day window for an expense approval.

The economic point is the one that makes this practical: a workflow that is sleeping, waiting for a retry, or waiting for an event does not count against your concurrency limit and burns no CPU time while it waits. So gating a high-stakes action — a money transfer, a data deletion, a public post — behind a human approval costs nothing while the approval sits in someone’s inbox for three days. The same mechanism handles scheduled sequences: step.sleep("wait", "2 days") or sleepUntil(date) pauses for up to a year, which is how you build a multi-day onboarding or follow-up flow without a separate cron service.

Limits, pricing, and status

A 2026 rearchitecture of the Workflows control plane raised the ceilings substantially: 50,000 concurrent instances, 300 new instances per second per account, and two million queued instances per workflow. A single instance can run up to 10,000 steps by default — raisable to 25,000 — each step can persist up to 1 MiB of state, and a step can sleep for up to 365 days.

50,000

concurrent Workflow instances per account after the 2026 control-plane rearchitecture — plus 300 new instances/second and 2M queued per workflow— Cloudflare Workflows v2 announcement

Pricing follows the Workers model rather than wall-clock: you pay for CPU time, invocations, and stored state. A workflow that’s sleeping, waiting on an API, or paused for approval consumes no CPU and costs nothing while idle. Production runs on the Workers Paid plan, which carries a $5-per-month account minimum, and persisted state is billed per gigabyte-month. Workflows has been generally available since April 2025, with the control-plane rearchitecture landing in April 2026.

One housekeeping note for anyone checking the docs alongside this: the published limits reference table lagged the 2026 rearchitecture at the time of writing, still showing the older concurrency numbers — the rearchitecture announcement is the authoritative source for the current figures. Re-verify both before relying on a specific number.

Workflows vs. the alternatives

Durable execution isn’t unique to Cloudflare; the question is operational cost and fit.

// Durable-execution engines compared — model, ops burden, and trade-off
Engine	Model	Ops burden	The trade-off
Cloudflare Workflows	Code-first TS/JS on Workers	None — no cluster to run	Cloudflare concentration; young platform
AWS Step Functions	JSON-based state-machine DSL	Managed by AWS	AWS-centric; you write a DSL, not code
Temporal	Code, language-agnostic	Operate a cluster, or pay Temporal Cloud	Very powerful; heavier day-one cost
Inngest / DBOS	Code-first	Low (Inngest serverless; DBOS uses your Postgres)	Best for event-driven or single-service apps

The honest summary: Workflows is the lowest-operations durable-execution engine if you are already on Cloudflare, and it’s the only one with a first-class Agents-SDK integration — the cost is Cloudflare concentration and a platform that’s only been GA since 2025. Step Functions is a reasonable default inside AWS but locks you into its DSL and ecosystem; Temporal is the right answer for heavy cross-service orchestration or multi-cloud portability, if you can absorb the operational weight. The deeper comparison against agent frameworks like LangGraph and AWS Bedrock AgentCore lives in its own spoke.

The honest trade-offs

Idempotency is your job. Because steps retry, you guard non-reversible side effects yourself. This is real design work, not a checkbox.
The 1 MiB step-output cap is the common foot-gun. A step.do() return value over 1 MiB fails; large outputs go to R2 with a reference returned instead.
Vendor concentration. Workflows is Cloudflare-specific — there’s no porting to Step Functions or Temporal without a rewrite. Weigh that against the operational simplicity it buys.
Young platform, fast change. GA in 2025, rearchitected in 2026, with limits and APIs that move monthly. The AgentWorkflow integration rides on the pre-1.0 Agents SDK, which changes often. Pin versions and expect to re-check.

Frequently asked

How is this different from writing my own retry logic in the agent?

Hand-rolled retries live in memory and die with the process — a crash or redeploy mid-run loses everything and re-runs from the top, including side effects. Workflows persists each step result durably, so a resume skips completed work and never re-fires a step that already succeeded. You also get pausing, scheduling, and human approval for free, which retry loops do not provide.

Does a workflow cost money while it waits for an approval?

No. A workflow that is sleeping or waiting for an event burns no CPU time and does not count against your concurrency limit. You can leave an approval pending for days at no compute cost.

When should I use a long-running agent session instead of a Workflow?

Use a long-running session when the work is genuinely one continuous task; use a Workflow when it is a series of discrete steps that should each be able to fail and retry independently. The long-running-session option is covered in the agent-as-Durable-Object spoke.

Is Cloudflare Workflows production-ready?

Workflows itself has been generally available since 2025 and was rearchitected for higher scale in 2026. The piece to watch is the Agents-SDK AgentWorkflow integration, which is newer and rides on a fast-moving pre-1.0 SDK — solid, but pin your versions.

Working with Truvisory

If you’d rather have a durable, retry-safe agent pipeline built and shipped than wire up the step model and approval gates yourself, this is the work we do: senior-engineer-led, fixed-scope agent systems on Cloudflare, with durable execution designed in from the first step. See how we deliver agent systems, or read the pillar guide to the full stack.

Tony Adams is the founder of Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. SBA-verified SDVOSB and VOSB, SAM.gov-registered.

Durable Execution for AI Agents: How Cloudflare Workflows Makes Multi-Step Work Survive Failure

What durable execution means, and how Workflows does it

The demo that breaks on the first flaky call

Agent or Workflow? Where each piece of logic goes

Pausing for a human, a clock, or an event

Limits, pricing, and status

Workflows vs. the alternatives

The honest trade-offs

Frequently asked

Working with Truvisory

More in this series

Building Production AI Agents on Cloudflare: The Complete Stack

Agent Memory and Grounding on Cloudflare: Long-Term Memory vs. RAG, and When to Use Each

The Cloudflare Agent Is a Durable Object: How Per-Agent State Works

Browser and Code-Execution Agents on Cloudflare: How Agents Act with Browser Run and Sandboxes

Cloudflare Agents vs. LangGraph (and Where AWS Bedrock AgentCore Fits): Framework vs. Runtime

How to Build and Host a Remote MCP Server on Cloudflare (McpAgent, OAuth, Portals, Code Mode)

Real-Time Voice Agents on Cloudflare: The Streaming Speech-to-Text → LLM → Text-to-Speech Loop

One email a month. Not a vendor blog.