Skip to main content
Truvisory
AI Agents

Real-Time Voice Agents on Cloudflare: The Streaming Speech-to-Text → LLM → Text-to-Speech Loop

Tony Adams 9 min read

A real-time voice agent — the kind that answers a phone call or holds a spoken conversation — is, underneath, a streaming loop: speech comes in and is transcribed, a language model decides what to say, and the reply is synthesized back to audio, all while the conversation’s history lives in a stateful agent. On Cloudflare that loop is the experimental @cloudflare/voice add-on for the Agents SDK, paired with Workers AI speech models and a WebSocket or WebRTC transport — and the whole thing can run colocated in the edge location the caller connects to. That colocation is the point, because most voice latency isn’t model time; in Cloudflare’s own words, it’s “the cost of bouncing audio and text between different services in different places.”

This is the real-time application of the stack the pillar describes. Voice is “just another input to the same agent,” so this spoke leans on its siblings: the agent itself is a stateful Durable Object — and the Durable Object cold-start caveat matters more for live voice than for anything else, which is why we tie the latency discussion back to it.

The @cloudflare/voice add-on

Cloudflare shipped @cloudflare/voice during Agents Week 2026; the launch post calls it experimental and the docs label it Beta — either way, treat it as early. It adds real-time voice to the same Agent class you’d already write, over a single WebSocket, with conversation history in the agent’s embedded SQLite. The minimal server is genuinely short:

import { Agent, routeAgentRequest } from "agents";
import { withVoice, WorkersAIFluxSTT, WorkersAITTS, type VoiceTurnContext } from "@cloudflare/voice";

const VoiceAgent = withVoice(Agent);

export class MyAgent extends VoiceAgent<Env> {
  transcriber = new WorkersAIFluxSTT(this.env.AI);
  tts = new WorkersAITTS(this.env.AI);

  async onTurn(transcript: string, context: VoiceTurnContext) {
    return `You said: ${transcript}`;
  }
}

export default {
  async fetch(request: Request, env: Env) {
    return (await routeAgentRequest(request, env)) ?? new Response("Not found", { status: 404 });
  }
} satisfies ExportedHandler<Env>;

You mix withVoice into an Agent, set a transcriber and a TTS provider, and implement onTurn. There’s a withVoiceInput variant for transcription-only use (dictation, voice search) and React hooks (useVoiceAgent) that handle mic capture, playback, and interrupt detection on the client. On the wrangler side you need an AI binding, a Durable Object binding for the agent, and a SQLite migration — the same setup any stateful agent uses.

How the pipeline works, step by step

Once a call starts, the flow is: the browser captures microphone audio and streams it as binary WebSocket frames; the agent opens a single transcription session that lives for the whole call; audio streams into it continuously; the speech model fires an end-of-turn event when it detects you’ve stopped talking; onTurn runs, typically an LLM call; the reply is split into sentences and synthesized as each sentence completes, so audio starts coming back before the full response is generated; and every message persists to SQLite so it survives reconnects and deploys.

Two design choices in there carry most of the quality. First, streaming at every stage. If onTurn returns a stream rather than a finished string, the pipeline chunks it into sentences and starts synthesis on the first one — what Cloudflare calls optimizing for time-to-first-audio, the single biggest lever on whether a reply feels fast. Second, model-driven turn detection. The client doesn’t tell the system when you’ve stopped speaking; the speech model decides, using both the sound and the content of your speech rather than a crude silence timer. Interruption — barge-in — is handled too: if the user speaks while the agent is talking, the current response is cancelled and the turn restarts, with the cancellation surfaced as an abort signal your onTurn code can respect.

The speech and model layer

The defaults run entirely on Workers AI, so you can build without signing up for a third-party speech vendor. Transcription defaults to Deepgram Flux — a model built specifically for voice agents that folds end-of-turn detection into the recognizer itself, reporting end-of-turn decisions in roughly a quarter-second — with Nova-3 available when you want dictation-style accuracy; synthesis defaults to Deepgram Aura. Whisper and MeloTTS are also on Workers AI if you want alternatives. Workers AI prices speech by usage — Flux transcription runs about three-quarters of a cent per audio minute and Aura synthesis about a cent and a half per thousand characters — which are inference costs only; the full-stack economics belong on the cluster’s cost pages, not here.

Two layers around the models matter. AI Gateway added a realtime WebSockets API that can sit in the inference path for streaming voice, adding caching, observability, rate limiting, and failover, and it proxies realtime providers including OpenAI’s Realtime API, Gemini Live, ElevenLabs, Cartesia, and Deepgram on Workers AI. And the provider interfaces are deliberately small, so bringing your own is straightforward — there are first-party adapters for Deepgram, ElevenLabs, and Twilio, a community Telnyx package, and an open invitation to wire in others. The speech-to-speech pattern (OpenAI’s Realtime API over WebSockets) is a separate option you can also route through the gateway.

Getting audio in and out: transport and telephony

By default the add-on uses a single WebSocket, which is the simplest path for one-to-one voice and needs no meeting infrastructure. For live client conversations, WebRTC is technically better — it rides UDP to avoid head-of-line blocking, its codec adapts to packet loss, and it brings built-in echo cancellation — and Cloudflare has the WebRTC pieces (its Realtime product and the RealtimeKit SDKs that came out of the Dyte acquisition), with the ability to pipe WebRTC audio into a Worker as raw audio. That path is real but still maturing; the voice add-on is WebSocket-native today and ships SFU utilities for the WebRTC route.

For phone calls, the thing to be clear-eyed about is that Cloudflare is not a telephone company — phone numbers and PSTN connectivity come from a third party like Twilio or Telnyx. The first-party path is a Twilio adapter: a phone call hits Twilio, Twilio streams the call’s audio over a WebSocket to your Worker, and the adapter bridges it to your voice agent. There’s one integration gotcha worth stating plainly because it will bite you: Twilio’s media streams are 8 kHz mulaw audio, but the default Workers AI synthesis returns MP3, which the Workers runtime can’t decode to the raw format the phone line needs. For production telephony you swap in a TTS provider that emits raw PCM or mulaw directly. This phone-call use case is central in practice — it’s the difference between a demo in a browser tab and an agent a customer can actually call. We run a live one ourselves: the voice agent on Truvisory.com answers a real phone number, built on exactly this pipeline.

Latency engineering, and the cold-start caveat

Voice is unforgiving about latency in a way text never is. Human conversation has turn-taking gaps around 200 milliseconds, and a perceived response under roughly 500 to 800 milliseconds is what feels natural; past that, the agent feels laggy. The budget accumulates across the whole loop — network in, transcription including the wait to be sure you’ve finished, the model’s time to its first token, synthesis time to first byte, and playback — and the two contributors people underestimate are turn detection (how long the system waits before deciding you’re done) and, again, the cost of data bouncing between services in different places. The platform’s answers are the ones already described: stream at every stage so audio starts on the first sentence, use model-driven turn detection rather than a silence timer, and colocate transcription, model, and synthesis so there are no cross-region hops.

~500–800 ms
round-trip response that feels conversational — past it, a voice agent reads as laggy; human turn-taking gaps sit near 200 ms — Cloudflare voice-agents blog

Then there’s the honest caveat, and it’s specific to voice. Every voice agent is a Durable Object, and a Durable Object that’s been idle hibernates and is eventually evicted from memory; waking a cold one re-runs its setup and takes a beat. For most agent workloads that’s irrelevant, but for live voice it lands in the worst possible place — the first response of a call. The voice pipeline keeps the agent alive during an active call, so the risk is the cold start at the very beginning. You mitigate it the way you’d expect: keep high-traffic agents warm with a low-rate heartbeat, and pre-warm the specific agent at call setup — on the inbound webhook, before audio starts flowing. The full Durable Object lifecycle and cold-start behavior is AG-A’s subject; the point here is just that voice is where it matters most, and you design around it.

Why Cloudflare for voice agents

The argument is the colocation one, made concrete. A voice agent’s pipeline has a lot of moving parts — transport, transcription, the model, synthesis, the stateful agent, any tools it calls mid-conversation, and the work that happens after the call — and on Cloudflare those can all run on one global network, close to the caller, rather than scattered across services and regions whose round-trips are exactly what blow the latency budget. You get per-use, scale-to-zero economics (an idle agent costs effectively nothing, and WebSocket hibernation keeps connections cheap), and because conversation state lives in the agent, a voice call, a text chat, and a phone call can share one history and one set of tools without re-architecting. That’s the pillar’s real-time and action layer applied to speech: the same agent, with audio as the interface.

The honest trade-offs

  • The voice add-on is brand-new and experimental. The launch post and the docs don’t even agree on the label. Package names, defaults, and APIs will move; pin versions and date your assumptions.
  • Cold start hits the first response. Real, and worst for voice. Engineer keep-warm and pre-warm at call setup; it’s a solvable problem but not a free one.
  • Voice latency is genuinely hard. The edge removes cross-region hops, but you still own a real budget across transcription, model, and synthesis. There’s no setting that makes it disappear.
  • Telephony needs a third party, with a format gotcha. Phone numbers come from Twilio or Telnyx, and the default MP3 synthesis is incompatible with a phone line’s mulaw audio — switch the TTS format for telephony.
  • Naturalness is provider-dependent. How human the agent sounds depends on the chosen speech models, not on Cloudflare.
  • Live sessions cost real money at volume. Per-minute transcription plus per-character synthesis plus model tokens add up across concurrent calls — model the unit economics before you scale.
  • Vendor concentration and a maturing realtime stack. Durable Objects run only on Cloudflare, and the WebRTC/RealtimeKit path into the voice add-on is recent and still settling. The WebSocket path is solid today.

Concrete patterns

The browser path is the thirty-line agent above plus the useVoiceAgent hook. The phone path adds the Twilio adapter (with a PCM- or mulaw-capable TTS in place of the MP3 default). And because a voice agent is still an ordinary Agent, three things compose without re-derivation here: it can call MCP tools mid-conversation to look something up or complete a booking; it can trigger a durable Workflow after the call ends for transcription, summarization, and CRM updates; and it can hand off to the browser and sandbox action layer when it needs to act on the web. If you’re building a customer-facing phone line, our contact-center automation work is the business-side companion to this technical guide.

Frequently asked

How hard is it to get started?
The minimal voice agent is about thirty lines and needs no external API keys, because the default transcription and synthesis run on Workers AI. You can prove the full speech-in, speech-out loop in a browser before you touch telephony.
Can the voice agent answer a real phone number?
Yes, through a third-party telephony provider like Twilio or Telnyx — Cloudflare is not a telco. There is a first-party Twilio adapter; just remember to switch the default MP3 synthesis to a raw-audio format, since a phone line uses mulaw.
What makes a voice agent feel slow, and how do I fix it?
Latency, almost always. Keep responses streaming end to end so audio starts on the first sentence, lean on model-driven turn detection rather than a silence timer, keep the model replies concise, and keep the agent warm so a cold start does not delay the first response.
Is this production-ready?
The runtime underneath — Workers AI, the Agents SDK, Durable Objects — is mature. The @cloudflare/voice add-on itself is experimental/Beta and moving fast. If you need a contractually stable, fully GA voice SDK today, that is a reason to wait; otherwise build on it deliberately and pin your versions.

Working with Truvisory

If you’d rather have a real-time voice agent built and shipped — browser or phone, with the latency budget engineered and the cold-start handled — this is the work we do: senior-engineer-led, fixed-scope agent systems on Cloudflare, and we’ll tell you honestly where the experimental edges still are. See how we deliver agent systems, or read the pillar guide to the full stack.