Cloudflare-Native

AI Observability, Cost, and Evaluation on Cloudflare: How AI Gateway Stops You Flying Blind

Tony Adams8 min readJune 2026

Most teams ship an AI feature and then fly blind. The model call is a black box — you can’t see the prompts, the tokens, the latency, or the errors. The bill arrives at the end of the month as a surprise. And “is it actually any good?” gets answered by vibes, because nobody set up a way to measure it. Those are three different problems — visibility, cost, and quality — and on Cloudflare they have one front door: AI Gateway, which sits between your application and your model providers and turns all three into instrumented, controllable things with a one-line endpoint change, colocated with the models and storage on the same network. This is the hub for that story; the pieces below each go deep on one layer.

It sits under our broader argument for why teams build AI on Cloudflare — this is the operational half of that case: not just why you’d build here, but how you’d see, control, and trust what you built once it’s live.

// THE SHORT VERSION

Shipping AI leaves you blind on three distinct problems: visibility (no observability), cost (surprise bills), and quality (vibes-based evaluation). They’re different problems, and they get different solutions.
AI Gateway is the control plane that addresses all three — analytics, logging, caching, rate limiting, retries and fallbacks, multi-provider routing — added with a one-line change to your SDK’s base URL. Its core features are free.
The same logging layer that gives a commercial team cost and quality visibility doubles as an audit trail for regulated buyers. That dual use is the most useful thing about it.
Be precise about the edges: the cost numbers are estimates, caching is exact-match only today, several features are still beta, and AI Gateway is not a security or egress firewall. Knowing the limits is part of using it well.

The three problems you can’t see after you ship

Visibility. A call to a model provider is, by default, opaque. You fire a request into an API and get a response back, and unless you’ve built your own instrumentation you have no view into what prompt actually went out, how many tokens it burned, how long it took, whether it errored, or whether you’re paying to ask the same question a thousand times. Multiply that across several models and providers and you have a system whose behavior you’re guessing at.

Cost. Token spend is unpredictable in a way that surprises teams used to fixed infrastructure costs. A feature that looked cheap in testing scales to a bill nobody forecast, the damage is already done by the time the invoice lands, and you often can’t even attribute the spend — which model, which feature, which customer drove it. Without per-request cost visibility, budgeting AI is guesswork.

Quality. This is the one teams most often skip, because seeing requests feels like understanding them — and it isn’t. Instrumentation tells you what happened; it says nothing about whether the answer was right. Without an evaluation harness, quality regressions ship silently: a prompt change or a model swap quietly makes things worse, and you find out from users. The fix is evaluation, and it’s a genuinely different discipline from observability.

The Cloudflare answer

AI Gateway is a proxy that sits between your application and your model providers. You point your existing SDK at it by changing the base URL — one line — and from then on every request flows through a control plane that adds the things you’d otherwise build yourself. It’s been generally available (GA) since Developer Week in 2024, and the core of it is free.

On the observability side, the dashboard surfaces requests, token usage, an estimated cost, errors, and cache-hit rate, with per-model and per-provider breakdowns, and every request can be logged with its full prompt, response, token count, duration, and status. On cost, exact-match caching cuts the bill for repeated requests and rate limiting caps runaway spend. On resilience and flexibility, it retries failed calls, falls back across providers, and fronts more than twenty model providers behind one OpenAI-compatible endpoint, with bring-your-own-key support. It even has the beginnings of native evaluation — building datasets from your logs and scoring them on cost, speed, and human feedback. And alongside the gateway, the platform’s own tools — Workers Logs, Analytics Engine for custom metrics, and Logpush for export — fill in the full picture.

The reason to run all of this on Cloudflare is colocation. The gateway, the models on Workers AI, and the storage in R2 and D1 sit on one network, so observability, inference, caching, and audit storage aren’t stitched together across vendors. Cloudflare runs its own AI engineering on exactly this stack.

~20M requests

routed through AI Gateway on Cloudflare's own internal AI stack — alongside 241 billion tokens processed, serving 3,600+ internal users— Cloudflare blog, Agents Week 2026

That’s the same control plane available to you, integrated with a single line of code rather than a separate platform to operate.

Three problems, three answers — keep them straight

The cleanest way to think about this cluster is as three distinct questions that people constantly run together:

Observability is seeing it — requests, tokens, latency, errors, cache-hit rate, and full request and response logs.
Cost control is predicting and reducing it — tracking token spend, caching to kill redundant calls, rate limiting to cap the bill, and routing to cheaper models where they’ll do.
Evaluation is measuring quality — golden datasets, model-as-judge scoring, human feedback, and regression testing.

The single most common mistake is collapsing the first and the third. Seeing every request go by feels like quality control, but instrumentation only tells you what happened, not whether it was correct. They’re separate problems with separate tools, and the spokes below treat them that way.

The same layer that proves cost also proves compliance

Here’s the part that earns this cluster its keep for two very different buyers at once. A commercial team turns on AI Gateway logging to understand cost and quality. A regulated buyer needs an audit trail — a durable, tamper-evident record of what the AI did. It’s the same logging layer. Pair AI Gateway’s persistent logs with R2’s object-lock storage, where written records can’t be deleted or altered for a set retention period, and the visibility tooling you wanted for operational reasons becomes the evidence trail you need for compliance.

One honest boundary on that: this is a FedRAMP-aware architecture, not a FedRAMP-certified one. Cloudflare for Government carries FedRAMP authorization for a defined set of services, but AI Gateway and Workers AI are not on that authorized list as of this writing — Cloudflare has stated an intent to bring them into the boundary, but intent isn’t authorization. The audit pattern is sound and worth building on; the compliance claim has to stay accurate. The keystone spoke below builds this out properly, and deliberately keeps CMMC out of it.

Go deeper

This hub frames the three problems; each spoke owns one layer in depth.

AI Gateway as your observability layer — the metrics, the logs, and what a full operational view actually looks like.
Controlling AI cost on Cloudflare — the model-and-token economics: caching, rate limiting, smaller-model routing, and reading the cost numbers honestly. (For infrastructure cost — the Workers-versus-Lambda billing math and R2 egress — see our existing cost comparison; this spoke is about model spend, not compute.)
Evaluating AI quality on Cloudflare — the home for the “build evals on day one” discipline: golden datasets, model-as-judge, offline versus online, regression testing.
Caching and latency for AI on the edge — the performance side of caching and the latency budget.
Multi-provider AI: failover, routing, and bring-your-own-key — provider-agnostic routing and not getting locked to one model vendor.
Audit and logging for AI systems — the dual-track keystone: AI Gateway logs plus R2 object-lock as a compliance-ready evidence trail.

The honest trade-offs

It adds a proxy hop. Every request now passes through the gateway, and turning on content-moderation guardrails adds roughly half a second while a safety model runs. Usually negligible against model latency, but real — measure it on your workload.
Logs have a ceiling, and hitting it is a blind spot. There’s a cap on stored logs per gateway, and when it fills, new logs silently stop being saved unless you’ve enabled automatic deletion. The free tier holds fewer than the paid tier before you must delete or upgrade, and log export is a paid add-on. The thing you turned on for visibility can quietly go dark if you don’t manage it.
The cost number is an estimate, not your bill. AI Gateway’s cost figure is computed from token counts; Cloudflare is explicit that you should treat it as an estimate and reconcile against your provider’s own dashboard for the real number. You can override it with negotiated rates, but it’s still an estimate.
Several features are beta. Dynamic routing, data-loss-prevention scanning, guardrails, bring-your-own-key, and the native evaluations are all beta as of this writing — expect change and pin behavior.
Observability is not evaluation. Seeing every request does not tell you whether the answers are good. You still need an eval harness, which is why it gets its own spoke.
It is not a security or egress firewall. AI Gateway governs your calls to model APIs; it does not police what an agent’s tools reach out to, and its guardrails are content moderation, not prompt-injection defense. Securing agents is a separate subject we’ll treat on its own; don’t mistake this layer for it.
Caching is exact-match only today. Semantically similar but differently-worded questions won’t hit the cache yet, which limits hit rates on naturally phrased queries. Semantic caching is on Cloudflare’s roadmap, not shipped.
Consolidation concentrates risk. Routing everything through one provider is convenient and centralizes a dependency at the same time — weigh both.

Frequently asked

What is AI Gateway?

A proxy that sits between your app and your AI model providers. You change your SDK's base URL to point at it, and it adds observability, logging, caching, rate limiting, retries, and multi-provider routing without you building any of that yourself.

Is it free?

The core — the dashboard analytics, caching, and rate limiting — is free. You pay for model inference itself, for log volume above the included tier, and for log export. Budget for inference and overflow logging, not for the gateway.

Will it tell me exactly what my AI is costing?

It gives you a close estimate, computed from token usage, broken down by model and provider — which is enough to spot trends and runaway spend. For your actual invoice, reconcile against the provider's own billing; Cloudflare is clear the gateway number is an estimate.

Does it evaluate quality for me?

Partly, and it's early. It can build datasets from your logs and score them on cost, speed, and human thumbs-up/down, with automated judging signaled as coming. For real quality measurement you'll build an eval harness on the platform — that's what the evals spoke covers.

Is it FedRAMP authorized?

No — treat the architecture as FedRAMP-aware, not certified. Cloudflare for Government holds authorization for a specific set of services, and AI Gateway and Workers AI aren't on that list today, though Cloudflare has signaled intent to add them. The audit pattern is sound; the certification claim isn't, so don't make it.

Working with Truvisory

If you’ve shipped AI and you’re flying blind on what it’s doing, what it costs, or whether it’s any good — see how we build instrumented, observable AI systems on Cloudflare, with the cost and quality controls wired in from the start.

Truvisory is a Denver-based AI and automation consultancy run by a senior operator — a combat veteran and former PE-backed operating executive — who ships working software, not strategy decks. Cloudflare-native by default, for both AI delivery and the back-office automation where the ROI lives. Federal buyers: we’re SDVOSB set-aside eligible — see the federal AI modernization pillar.

Tony Adams is the founder of Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. SBA-verified SDVOSB and VOSB, SAM.gov-registered.

AI Observability, Cost, and Evaluation on Cloudflare: How AI Gateway Stops You Flying Blind

The three problems you can’t see after you ship

The Cloudflare answer

Three problems, three answers — keep them straight

The same layer that proves cost also proves compliance

Go deeper

The honest trade-offs

Frequently asked

Working with Truvisory

More in this series

Cloudflare Audit Logs for AI: A Tamper-Evident, Compliance-Grade Record of Every AI Request

Latency Engineering for AI on Cloudflare: Cache the Hot Path, Stream the Rest, Route to Faster Models

Controlling AI Model Costs on Cloudflare: The Levers That Actually Reduce Token Spend

AI Evals on Cloudflare: How to Measure Whether Your AI Is Actually Good

Cloudflare AI Gateway as Your Observability Layer: Every LLM Request, Logged and Queryable

Multi-Provider AI Routing on Cloudflare: Fallback, Retries, and BYOK That Keep Your App Up

One email a month. Not a vendor blog.