Skip to main content
Truvisory
Cloudflare-Native

Latency Engineering for AI on Cloudflare: Cache the Hot Path, Stream the Rest, Route to Faster Models

Tony Adams 9 min read

AI applications feel slow for one reason: a model call takes seconds, and your user is staring at a spinner the whole time. You make them fast with four levers — cache the hot, repeated paths so they return from the edge in milliseconds instead of a full provider round-trip; stream responses so the first words appear almost immediately even when the full answer is still generating; route latency-sensitive work to faster models; and trim output, because output length dominates generation time. Then you prove each win with the one number the gateway records per request: duration. None of these is exotic, and all of them are a header or a binding away on Cloudflare.

This is the performance spoke of our AI observability, cost, and evaluation cluster. A fair caveat up front: the cluster’s named third axis is evaluation, not speed, so latency sits a little to the side of the triad. But it belongs in the same conversation, because the real optimization problem for any AI feature is balancing cost, quality, and speed at once — and speed is the lever this piece owns.

Caching for speed: what it actually buys you

On a cache hit, AI Gateway returns the stored response directly from Cloudflare’s edge, skipping the round-trip to the model provider entirely — the documentation lists reduced latency as the first benefit of caching. That’s the win, and it’s a big one in raw terms: milliseconds from a nearby edge location versus the multi-second wait for a model to generate from scratch. You’ll see Cloudflare’s marketing cite “up to 90%” latency reduction; treat that as a vendor claim rather than a measured benchmark for your workload, because the figure is generic caching copy, not an AI-Gateway-specific number, and your actual reduction depends entirely on your hit rate.

And hit rate is where honesty matters, because it caps everything. AI Gateway caching is exact-match only: the cache key is a hash of the entire request, so any difference at all — a changed word, a reordered parameter, a different system prompt — is a separate entry and a miss. Semantic caching, which would match similar-but-not-identical prompts, is not shipped; Cloudflare has said it’s planned. The practical consequence for latency is direct: caching makes hot, repeated, deterministic paths fast — a support bot answering the same bounded questions, a generation step with fixed inputs — and does almost nothing for free-form user queries where no two requests are byte-identical. (The cost side of this same mechanism — that a cache hit costs nothing in provider tokens — is the cost spoke’s territory; here it’s purely about speed.)

Two more realities to design around. The first request on any key is always a miss and pays full latency — caching only helps the Nth identical request, never the first. And the cache is volatile: if two identical requests arrive simultaneously, the first may not be cached in time for the second, so both miss. There’s no request coalescing that collapses in-flight duplicates into one. So the strategy is to cache deliberately — identify your genuinely hot, repeatable paths, set a TTL on them with cf-aig-cache-ttl (anywhere from sixty seconds to a month), and accept that the long tail of unique requests won’t benefit. Longer TTLs mean more hits and more speed, but staler answers — a fine trade for a stable FAQ, a bad one for anything time-sensitive or personalized.

Streaming: the biggest win for how fast it feels

If you do one thing for AI latency, stream. Total generation time for a long answer is fixed by the model, but a user who sees the first words in a few hundred milliseconds experiences a fast app; a user who waits eight seconds for the whole thing to appear at once experiences a broken one — even when the total time is identical. This is the gap between actual and perceived latency, and decades of human-factors research put the threshold for “feels responsive” at well under a second and the limit for holding attention at around ten.

<1s
the response-time threshold that keeps a user's flow of thought uninterrupted — streaming the first tokens is how an AI app stays on the right side of it (10s is the limit for holding attention) — Nielsen Norman Group, response-time limits

Streaming is how you stay on the right side of that line. Workers AI streams every catalog LLM: set stream: true and tokens arrive as server-sent events as they’re generated rather than in one block at the end. AI Gateway passes that stream through in real time, and the requests are still fully logged. The architecture cooperates — Cloudflare Workers have no effective limit on how long a streaming response can run, so long generations aren’t cut off.

Here’s the trap that silently erases the entire benefit: streaming only helps if no layer buffers the whole response before sending it. It’s easy to wrap a stream in a framework helper that quietly accumulates every token and flushes them all at once after generation completes — at which point your time-to-first-token is back to the full multi-second wait and every byte arrives in a single burst, exactly as if you hadn’t streamed at all. The fix is to return the readable stream directly in your Response and keep the path uncacheable so nothing along the way re-buffers it. Verify it by watching the bytes arrive incrementally, not by trusting that stream: true did the job.

The other latency levers

Beyond caching and streaming, a few more knobs are worth knowing.

Faster models. Cloudflare publishes speed-optimized variants — model names carrying -fast or fp8-fast — and has shipped inference improvements (speculative decoding, prefix caching) that it credits with speeding up generation on its Llama models by meaningful multiples. Beyond its own catalog, the same one-line gateway reaches specialist fast-inference providers like Groq and Cerebras, which independent benchmarks rank among the fastest available for open models. The point for this spoke is the rationale — when a task is latency-sensitive and a smaller or faster-served model is good enough, speed is a legitimate reason to route to it. The mechanics of routing and failover are the routing spoke’s subject; the quality question — whether the faster model is actually good enough — is the evals spoke’s, and you should answer it before you cut over, because a speed win that quietly degrades answers is a regression, not an improvement.

Edge proximity. Cloudflare runs in hundreds of cities and reaches the large majority of internet users within tens of milliseconds, and Workers start in single-digit milliseconds on V8 isolates rather than booting a container. When your inference runs on Workers AI, the call from your Worker to the model is in-process — no extra network hop. The deeper compute-latency comparison (Workers versus a traditional serverless platform for AI inference, cold starts included) lives in our Workers-versus-Lambda inference piece; here it’s enough to note that running close to the user removes network latency that a centralized deployment pays on every request.

Output size. Output tokens are generated one after another, so a long answer is a slow answer — capping max_tokens and asking for concise output is one of the most reliable latency reductions available. This is the same knob the cost spoke pulls for a different reason, so own the speed framing and route the cost framing there.

Retries and timeouts. The gateway can automatically retry failed upstream requests — configurable attempts, delay, and backoff via cf-aig-max-attempts, cf-aig-retry-delay, and cf-aig-backoff — and you can set a request timeout with cf-aig-request-timeout that’s measured against when the first part of the response comes back, which is streaming-friendly. The honest framing: retries buy reliability, but they add latency on the failure paths, so they’re a tradeoff to tune, not a free win.

Measuring the win

You can’t tune what you can’t see, and the gateway records the number you need: each request’s duration is captured in its log, alongside a cf-aig-cache-status header that tells you HIT or MISS. Pair those two and the latency story becomes concrete — compare hit durations against miss durations to quantify what caching is actually saving, and compare durations across models and providers to see which is fastest for your traffic. One thing to expect: latency is not a headline number on the analytics dashboard, which leads with requests, tokens, cost, errors, and cache-hit rate; duration lives at the per-request log level. The mechanics of reading and exporting all this belong to the observability spoke — this spoke is about what to do with the number once you have it.

The honest trade-offs

  • Exact-match caching caps the benefit. Semantic caching isn’t shipped, so the latency win is real only for hot, repeated, deterministic paths — not free-form queries.
  • The first request is always slow. Cold cache means a miss and full latency; caching helps the Nth identical request, never the first.
  • Concurrent duplicates can both miss. The cache is volatile and there’s no in-flight coalescing.
  • The gateway adds a proxy hop. A third-party integration doc estimates roughly ten to fifty milliseconds for routing through Cloudflare’s network; that’s an illustrative outside estimate, not a Cloudflare service guarantee, and usually small against a multi-second model call.
  • Longer TTLs trade freshness for speed — wrong for time-sensitive or personalized content.
  • Faster models can be worse. Pair any speed-motivated model swap with evaluation; quality-check before you ship it.
  • Streaming improves perceived, not actual, latency — and a buffering misconfiguration erases even that.
  • Retries add latency on failure paths — a reliability/latency tradeoff, not a free lunch.
  • Most latency is the model’s, not Cloudflare’s. The biggest single factor is which model you call and how it’s served, much of which is outside Cloudflare’s control — especially for off-platform frontier models, where you pay both their token fees and their latency.
  • Vendor concentration. Putting one gateway in front of all your AI traffic is convenient and is also a single point worth naming.

Concrete patterns

Cache a hot path for speed by sending cf-aig-cache-ttl: 3600, then read cf-aig-cache-status on the response to confirm hits and to measure the duration difference between a hit and a miss. Stream a Workers AI response by passing stream: true and returning the readable stream directly in your Response with a text/event-stream content type — and resist wrapping it in anything that might buffer, since that’s what kills time-to-first-token. Cap latency on long answers by setting a sensible max_tokens. And where reliability matters, configure cf-aig-max-attempts and a cf-aig-request-timeout, knowing the timeout keys off the first byte so it plays well with streaming. Keep the caching deliberate, verify streaming actually streams, and let the per-request duration tell you whether each change moved the number.

Frequently asked

What's the single biggest thing I can do to make my AI app feel faster?
Stream the response. Total generation time is fixed by the model, but showing the first tokens in a few hundred milliseconds is the difference between a responsive app and a broken-feeling one — as long as nothing in your stack buffers the stream before sending it.
Does AI Gateway caching make my app faster?
On a cache hit, yes — the response comes from Cloudflare's edge in milliseconds instead of a full provider round-trip. But matching is exact-only (no semantic caching yet), so it speeds up repeated, deterministic requests and does little for varied, naturally-worded ones. Check your cf-aig-cache-status hit rate before counting on it.
How much latency does the gateway itself add?
A third-party integration doc estimates roughly ten to fifty milliseconds for the proxy hop — an outside estimate, not a Cloudflare guarantee, and typically small next to a multi-second model call. On a cache hit, the gateway makes you net faster, not slower.
Can I just route everything to a faster model?
You can route latency-sensitive work to a faster or smaller model through the same gateway, but check quality first — a faster model that gives worse answers is a regression. Decide the routing on measured quality, not just speed.
Where do I see my latency numbers?
In the per-request logs, as a duration field; it's not a headline dashboard metric. Pair duration with the cache-status header to quantify what caching saves and to compare model and provider speed.

Working with Truvisory

If your AI feature is fast enough in a demo but slow in production, see how we build low-latency AI systems on Cloudflare — caching the hot paths, streaming the rest, and measuring every change against real per-request timings.

Truvisory is a Denver-based AI and automation consultancy run by a senior operator — a combat veteran and former PE-backed operating executive — who ships working software, not strategy decks. Cloudflare-native by default, for both AI delivery and the back-office automation where the ROI lives.