AI Observability, Cost, and Evaluation on Cloudflare: How AI Gateway Stops You Flying Blind
Most teams ship an AI feature and then fly blind. The model call is a black box — you can’t see the prompts, the tokens, the latency, or the errors. The bill arrives at the end of the month as a surprise. And “is it actually any good?” gets answered by vibes, because nobody set up a way to measure it. Those are three different problems — visibility, cost, and quality — and on Cloudflare they have one front door: AI Gateway, which sits between your application and your model providers and turns all three into instrumented, controllable things with a one-line endpoint change, colocated with the models and storage on the same network. This is the hub for that story; the pieces below each go deep on one layer.
It sits under our broader argument for why teams build AI on Cloudflare — this is the operational half of that case: not just why you’d build here, but how you’d see, control, and trust what you built once it’s live.
The three problems you can’t see after you ship
Visibility. A call to a model provider is, by default, opaque. You fire a request into an API and get a response back, and unless you’ve built your own instrumentation you have no view into what prompt actually went out, how many tokens it burned, how long it took, whether it errored, or whether you’re paying to ask the same question a thousand times. Multiply that across several models and providers and you have a system whose behavior you’re guessing at.
Cost. Token spend is unpredictable in a way that surprises teams used to fixed infrastructure costs. A feature that looked cheap in testing scales to a bill nobody forecast, the damage is already done by the time the invoice lands, and you often can’t even attribute the spend — which model, which feature, which customer drove it. Without per-request cost visibility, budgeting AI is guesswork.
Quality. This is the one teams most often skip, because seeing requests feels like understanding them — and it isn’t. Instrumentation tells you what happened; it says nothing about whether the answer was right. Without an evaluation harness, quality regressions ship silently: a prompt change or a model swap quietly makes things worse, and you find out from users. The fix is evaluation, and it’s a genuinely different discipline from observability.
The Cloudflare answer
AI Gateway is a proxy that sits between your application and your model providers. You point your existing SDK at it by changing the base URL — one line — and from then on every request flows through a control plane that adds the things you’d otherwise build yourself. It’s been generally available (GA) since Developer Week in 2024, and the core of it is free.
On the observability side, the dashboard surfaces requests, token usage, an estimated cost, errors, and cache-hit rate, with per-model and per-provider breakdowns, and every request can be logged with its full prompt, response, token count, duration, and status. On cost, exact-match caching cuts the bill for repeated requests and rate limiting caps runaway spend. On resilience and flexibility, it retries failed calls, falls back across providers, and fronts more than twenty model providers behind one OpenAI-compatible endpoint, with bring-your-own-key support. It even has the beginnings of native evaluation — building datasets from your logs and scoring them on cost, speed, and human feedback. And alongside the gateway, the platform’s own tools — Workers Logs, Analytics Engine for custom metrics, and Logpush for export — fill in the full picture.
The reason to run all of this on Cloudflare is colocation. The gateway, the models on Workers AI, and the storage in R2 and D1 sit on one network, so observability, inference, caching, and audit storage aren’t stitched together across vendors. Cloudflare runs its own AI engineering on exactly this stack.
That’s the same control plane available to you, integrated with a single line of code rather than a separate platform to operate.
Three problems, three answers — keep them straight
The cleanest way to think about this cluster is as three distinct questions that people constantly run together:
- Observability is seeing it — requests, tokens, latency, errors, cache-hit rate, and full request and response logs.
- Cost control is predicting and reducing it — tracking token spend, caching to kill redundant calls, rate limiting to cap the bill, and routing to cheaper models where they’ll do.
- Evaluation is measuring quality — golden datasets, model-as-judge scoring, human feedback, and regression testing.
The single most common mistake is collapsing the first and the third. Seeing every request go by feels like quality control, but instrumentation only tells you what happened, not whether it was correct. They’re separate problems with separate tools, and the spokes below treat them that way.
The same layer that proves cost also proves compliance
Here’s the part that earns this cluster its keep for two very different buyers at once. A commercial team turns on AI Gateway logging to understand cost and quality. A regulated buyer needs an audit trail — a durable, tamper-evident record of what the AI did. It’s the same logging layer. Pair AI Gateway’s persistent logs with R2’s object-lock storage, where written records can’t be deleted or altered for a set retention period, and the visibility tooling you wanted for operational reasons becomes the evidence trail you need for compliance.
One honest boundary on that: this is a FedRAMP-aware architecture, not a FedRAMP-certified one. Cloudflare for Government carries FedRAMP authorization for a defined set of services, but AI Gateway and Workers AI are not on that authorized list as of this writing — Cloudflare has stated an intent to bring them into the boundary, but intent isn’t authorization. The audit pattern is sound and worth building on; the compliance claim has to stay accurate. The keystone spoke below builds this out properly, and deliberately keeps CMMC out of it.
Go deeper
This hub frames the three problems; each spoke owns one layer in depth.
- AI Gateway as your observability layer — the metrics, the logs, and what a full operational view actually looks like.
- Controlling AI cost on Cloudflare — the model-and-token economics: caching, rate limiting, smaller-model routing, and reading the cost numbers honestly. (For infrastructure cost — the Workers-versus-Lambda billing math and R2 egress — see our existing cost comparison; this spoke is about model spend, not compute.)
- Evaluating AI quality on Cloudflare — the home for the “build evals on day one” discipline: golden datasets, model-as-judge, offline versus online, regression testing.
- Caching and latency for AI on the edge — the performance side of caching and the latency budget.
- Multi-provider AI: failover, routing, and bring-your-own-key — provider-agnostic routing and not getting locked to one model vendor.
- Audit and logging for AI systems — the dual-track keystone: AI Gateway logs plus R2 object-lock as a compliance-ready evidence trail.
The honest trade-offs
- It adds a proxy hop. Every request now passes through the gateway, and turning on content-moderation guardrails adds roughly half a second while a safety model runs. Usually negligible against model latency, but real — measure it on your workload.
- Logs have a ceiling, and hitting it is a blind spot. There’s a cap on stored logs per gateway, and when it fills, new logs silently stop being saved unless you’ve enabled automatic deletion. The free tier holds fewer than the paid tier before you must delete or upgrade, and log export is a paid add-on. The thing you turned on for visibility can quietly go dark if you don’t manage it.
- The cost number is an estimate, not your bill. AI Gateway’s cost figure is computed from token counts; Cloudflare is explicit that you should treat it as an estimate and reconcile against your provider’s own dashboard for the real number. You can override it with negotiated rates, but it’s still an estimate.
- Several features are beta. Dynamic routing, data-loss-prevention scanning, guardrails, bring-your-own-key, and the native evaluations are all beta as of this writing — expect change and pin behavior.
- Observability is not evaluation. Seeing every request does not tell you whether the answers are good. You still need an eval harness, which is why it gets its own spoke.
- It is not a security or egress firewall. AI Gateway governs your calls to model APIs; it does not police what an agent’s tools reach out to, and its guardrails are content moderation, not prompt-injection defense. Securing agents is a separate subject we’ll treat on its own; don’t mistake this layer for it.
- Caching is exact-match only today. Semantically similar but differently-worded questions won’t hit the cache yet, which limits hit rates on naturally phrased queries. Semantic caching is on Cloudflare’s roadmap, not shipped.
- Consolidation concentrates risk. Routing everything through one provider is convenient and centralizes a dependency at the same time — weigh both.
Frequently asked
What is AI Gateway?
Is it free?
Will it tell me exactly what my AI is costing?
Does it evaluate quality for me?
Is it FedRAMP authorized?
Working with Truvisory
If you’ve shipped AI and you’re flying blind on what it’s doing, what it costs, or whether it’s any good — see how we build instrumented, observable AI systems on Cloudflare, with the cost and quality controls wired in from the start.
Truvisory is a Denver-based AI and automation consultancy run by a senior operator — a combat veteran and former PE-backed operating executive — who ships working software, not strategy decks. Cloudflare-native by default, for both AI delivery and the back-office automation where the ROI lives. Federal buyers: we’re SDVOSB set-aside eligible — see the federal AI modernization pillar.
