AI Evals on Cloudflare: How to Measure Whether Your AI Is Actually Good
Observability tells you what your AI did — the prompts, tokens, latency, cost, and errors. Evaluation tells you whether what it did was any good. Those are two different disciplines, and you need both. You measure quality with golden datasets and an automated judge run before every deploy (offline, as a regression test), plus human feedback and guardrail checks on live traffic (online). On Cloudflare you get a small set of native evaluators — cost, speed, and thumbs-up human feedback — plus the raw building blocks to assemble the rest yourself: a Workers AI model as your judge, R2 or D1 to hold golden datasets, Workflows to run durable eval jobs, and the AI Gateway feedback API to capture production signal. This is the deep dive on how to do that — and on where Cloudflare’s native evals genuinely fall short.
This is the evaluation spoke of our AI observability, cost, and evaluation cluster — the “measure quality” layer that both the observability spoke and the cost spoke point to. It’s the highest-leverage thing on this whole list, and the one most teams skip.
Observability tells you what happened; evals tell you if it was good
This is the distinction the whole cluster turns on, so it’s worth being precise. The observability spoke covers seeing your traffic — every request’s prompt, response, token count, latency, and cost, logged and queryable. That answers what happened. It does not answer was the answer right — a response can be fast, cheap, well-formed, and completely wrong, and observability will show you a clean green row for it. Quality is a separate measurement, and you only get it by evaluating outputs against some notion of what “good” means.
Why this matters in practice: the moment you change anything — edit a system prompt, upgrade a model, adjust a retrieval step — you risk a silent regression, where the metrics still look healthy but the answers got worse. Fluency hides it; a confident, fluent answer reads fine until someone checks whether it’s true. The only defense is systematic evaluation, run on every change, which is why the recurring advice across our agents work is to build evals on day one rather than bolting them on after the first quality incident. Vibes-based testing — a few people eyeballing outputs and judging by tone — is how those incidents happen.
The bridge between the two disciplines is one field. Every AI Gateway request returns a log ID (cf-aig-log-id), and you attach an evaluation — a score, a thumbs-up, structured feedback — to a specific call by that ID. The logging mechanics belong to the observability spoke; here, that ID is the hook evaluation hangs on.
What AI Gateway’s native evals give you today
Be clear-eyed about the current state, because it’s the thing most write-ups get wrong. AI Gateway has a native Evaluations feature, and it’s in open beta with exactly three evaluators: Cost (average inference cost across a dataset), Speed (average request duration), and Human feedback (the percentage of thumbs-up on the logs you’ve annotated). You build a dataset by filtering your logs — by model, provider, status, cost, metadata, and so on — and run the evaluators against it; results show the metrics and how many logs fed each one.
Here’s the honest callout, and it’s load-bearing: an automated or LLM-as-judge evaluator has not shipped. Cloudflare has signaled that richer evaluators using a model as a judge are coming, but as of this writing the native set is still cost, speed, and human thumbs-up, and the documentation still describes additional evaluators as a future addition. So if you want anything beyond those three — semantic correctness, faithfulness, rubric scores, regression gates — you build it yourself or integrate a dedicated tool. Treat that as the current reality and re-check it before you rely on it, because it’s exactly the kind of thing Cloudflare ships on a monthly cadence.
A few sharp edges worth knowing while you’re in there. Native datasets are capped at ten per gateway, and the dataset filters use AND logic — one value per filter, so you can’t build “model A or model B” in a single set. Datasets refresh as new logs match the filter, but evaluations don’t re-run automatically; you create a fresh evaluation to score new logs. And the feedback that drives the Human-feedback metric is captured three ways — from the dashboard, from a PATCH on the log ID via the API, or from the Worker binding patchLog() — where feedback is -1, 0, or 1, alongside an optional 0–100 score and a metadata object. One precise point: the native metric is computed from the binary thumbs-up/down only; the 0–100 score is stored but not surfaced as a native evaluator, so score-based dashboards are something you build.
// Capture production signal: attach feedback to a specific call by its log id
await env.AI.gateway("my-gateway").patchLog(logId, {
feedback: 1, // -1 negative, 1 positive, 0 not evaluated
score: 92, // 0-100, stored but not a native metric
metadata: { reviewer: "ops" }
});Building the rest yourself
Because the native evals stop at three, the substance of evaluation on Cloudflare is the build-it-yourself pattern — and the pieces are all on the platform, which is the real argument for doing it here. You need a judge, a dataset, and a runner.
The judge is a model that scores an output against a rubric. Run a Workers AI model with JSON mode so it returns a structured verdict — a score and its reasoning — rather than prose you have to parse, or route the judge call through AI Gateway to a frontier model if you want a stronger evaluator. The golden dataset — trusted inputs paired with ideal or reference outputs — lives in R2 as JSON or JSONL for large sets, or in D1 (serverless SQLite with native JSON functions) when you want to query inputs and results as structured rows. The runner is Cloudflare Workflows: a durable execution engine where each step’s result is persisted, so an eval job that iterates a dataset, generates a candidate, judges it, and stores the score survives failures and resumes mid-run instead of restarting. (The Workflows internals — steps, retries, sleeps — are the subject of our agents cluster’s durable-execution spoke; here it’s simply the eval runner.)
// Offline eval loop as a Workflow (conceptual)
const cases = await step.do("load-golden-set", () => loadFromR2(env));
for (const c of cases) {
const out = await step.do(`generate-${c.id}`, () => generate(env, c.input));
const score = await step.do(`judge-${c.id}`, () => judge(env, c.input, out, c.reference));
await step.do(`persist-${c.id}`, () => saveToD1(env, c.id, score));
}
// fail the run if the aggregate score drops below your regression thresholdThat loop is the heart of offline evaluation: run it in CI on every prompt or model change, and fail the build if the aggregate score regresses.
The concepts you need
Offline versus online. These are the two halves of a real eval practice, and Cloudflare gives you a home for each:
| Offline eval | Online eval | |
|---|---|---|
| When | Before deploy, on every change | In production, on live traffic |
| Against | A fixed golden dataset | Real user requests |
| Signal | An automated judge and metrics | Human feedback, guardrail checks |
| On Cloudflare | Workers + Workflows | The AI Gateway feedback API |
| Catches | Regressions before they ship | Drift and real-world failures |
Golden datasets are the foundation, and they’re real work — trusted inputs with ideal outputs, usually human-labeled, versioned, and refreshed with hard cases pulled from production. This is not a one-afternoon task; teams routinely invest dozens of hours building a few hundred well-chosen cases, and that dataset is what later lets you swap to a cheaper model and prove quality held.
LLM-as-judge, and its biases. Using a model to grade outputs scales evaluation, and a strong judge can agree with human preferences well over 80% of the time — about the rate at which humans agree with each other. But it is not ground truth, and it has measured biases you have to design around: position bias (favoring the first answer shown), verbosity bias (favoring longer answers), and self-preference (favoring its own family’s style). Position bias in particular is large — in one study, simply reordering two responses flipped the verdict, letting a weaker model “beat” a stronger one on 66 of 80 queries purely from order. The mitigations are concrete: swap the order and only count a win if it holds both ways, score against a reference answer, and use a different model family (or an ensemble) as the judge rather than the model you’re evaluating. Judges also drift — a minor model upgrade can silently shift scores while your CI gate keeps passing — so recalibrate periodically.
Regression testing ties it together: the golden set plus the judge, run on every change, with a pass/fail threshold wired into CI. That’s what turns “we think this prompt is better” into evidence.
Evaluating RAG and agents
These are harder than scoring a single response, and the article should say so plainly. For retrieval-augmented generation, you score two things separately, because they fail separately: retrieval quality (did the system fetch the right context?) and generation quality (was the answer grounded in that context and relevant to the question?). The widely used RAG triad names these as context relevance, faithfulness or groundedness, and answer relevance. An answer can be fluent and relevant yet unfaithful — confidently stating something the retrieved context doesn’t support — which is exactly the failure RAG eval exists to catch. On Cloudflare you pull the retrieved context from AI Search or Vectorize and judge faithfulness against it; the RAG internals themselves are covered in our agents cluster’s memory and grounding spoke, which is the place that argues, at length, for building RAG evals from the start.
For agents, you evaluate the trajectory, not just the final answer — did it pick the right tools, call them with the right arguments, in a sensible order, and stop when it should? A multi-step agent has many more places to go wrong than a single call, and a correct final answer can mask a wasteful or fragile path. Score trajectories either deterministically (the expected tools in the expected order) or with a judge reviewing the path against a reference.
Where Cloudflare sits — and why build evals here anyway
Be fair about the landscape. There’s a mature ecosystem of dedicated eval tooling — Promptfoo for CI-first testing (it has a native Workers AI provider), DeepEval for a large metric library, RAGAS as the RAG-eval standard, plus LangSmith, Braintrust, Arize Phoenix, and Langfuse for the full tracing-and-eval lifecycle. Cloudflare’s native evals are early by comparison, and the honest framing is that Cloudflare gives you building blocks plus a basic native feature — and the option to integrate any of those dedicated tools, since AI Gateway exports traces in OpenTelemetry format to backends like Langfuse and Phoenix. You are not locked into the native evals.
So why build here at all? Colocation. The judge (Workers AI or a frontier model via the gateway), the datasets (R2 or D1), the durable runner (Workflows), and the production feedback (the gateway’s log-ID API) all live in one account, on one network, on one bill, with per-use economics. Offline evaluation against golden sets and online evaluation from real traffic sit side by side rather than in a separate platform you have to wire up and pay for. Framed against the cluster: observability tells you what happened, cost control tells you what it costs, and evaluation tells you whether it’s good — and evaluation is the leg that lets you safely act on the other two, because it’s how you prove a cheaper model still holds quality before you switch to it.
The honest trade-offs
- Native evals are early and limited — cost, speed, and human thumbs-up only; no LLM-as-judge yet. Sophisticated evaluation is build-it-yourself or integrate a dedicated tool.
- The native dataset feature is constrained — ten per gateway, AND-only filters, and evaluations don’t auto-refresh.
- An LLM judge is not ground truth. It carries position, verbosity, and self-preference biases, and can occasionally prefer wrong answers outright. Mitigate with order-swapping, reference scoring, and a different judge model — don’t trust it blindly.
- RAG and agent evaluation are genuinely hard regardless of platform — retrieval and generation must be scored separately, and agent trajectories add whole new failure modes.
- Golden datasets are real work to build and maintain, and they drift; budget for it and refresh from production.
- Human feedback is sparse and noisy — the thumbs-up metric depends on users actually clicking, so low volume means high variance.
- Evaluation is ongoing, not one-time. Regression-test every change and recalibrate judges as models update.
- Consolidating judge, data, runner, and feedback on one platform is convenient and concentrates a dependency; OpenTelemetry export is the hedge.
Frequently asked
Does Cloudflare have an LLM-as-judge evaluator?
What's the difference between offline and online evaluation?
How do I capture quality signal from production?
How do I build a golden dataset?
How do I evaluate a RAG system?
Working with Truvisory
If you’re shipping AI and quality is still decided by vibes, see how we build evaluated AI systems on Cloudflare — with golden datasets, an automated judge, and regression gates wired in, so “is it good?” is a number you can defend rather than a feeling.
Truvisory is a Denver-based AI and automation consultancy run by a senior operator — a combat veteran and former PE-backed operating executive — who ships working software, not strategy decks. Cloudflare-native by default, for both AI delivery and the back-office automation where the ROI lives.
