Cloudflare-Native

AI Evals on Cloudflare: How to Measure Whether Your AI Is Actually Good

Tony Adams11 min readJune 2026

Observability tells you what your AI did — the prompts, tokens, latency, cost, and errors. Evaluation tells you whether what it did was any good. Those are two different disciplines, and you need both. You measure quality with golden datasets and an automated judge run before every deploy (offline, as a regression test), plus human feedback and guardrail checks on live traffic (online). On Cloudflare you get a small set of native evaluators — cost, speed, and thumbs-up human feedback — plus the raw building blocks to assemble the rest yourself: a Workers AI model as your judge, R2 or D1 to hold golden datasets, Workflows to run durable eval jobs, and the AI Gateway feedback API to capture production signal. This is the deep dive on how to do that — and on where Cloudflare’s native evals genuinely fall short.

This is the evaluation spoke of our AI observability, cost, and evaluation cluster — the “measure quality” layer that both the observability spoke and the cost spoke point to. It’s the highest-leverage thing on this whole list, and the one most teams skip.

// THE SHORT VERSION

Observability and evaluation are different. Seeing every request tells you what happened; only evaluation tells you whether it was correct — and whether a change is safe to ship.
Build evals on day one. Without them, quality is decided by vibes, and regressions ship silently when you tweak a prompt or swap a model.
Cloudflare’s native AI Gateway evals are early and limited — cost, speed, and human thumbs-up only. An automated LLM-as-judge evaluator is on the roadmap but not shipped, so anything more sophisticated you build yourself or integrate from a dedicated tool.
The building blocks are all here and colocated: a judge model on Workers AI, golden datasets in R2 or D1, a durable runner in Workflows, and a production feedback API keyed to each request — offline and online evaluation in one account.

Observability tells you what happened; evals tell you if it was good

This is the distinction the whole cluster turns on, so it’s worth being precise. The observability spoke covers seeing your traffic — every request’s prompt, response, token count, latency, and cost, logged and queryable. That answers what happened. It does not answer was the answer right — a response can be fast, cheap, well-formed, and completely wrong, and observability will show you a clean green row for it. Quality is a separate measurement, and you only get it by evaluating outputs against some notion of what “good” means.

Why this matters in practice: the moment you change anything — edit a system prompt, upgrade a model, adjust a retrieval step — you risk a silent regression, where the metrics still look healthy but the answers got worse. Fluency hides it; a confident, fluent answer reads fine until someone checks whether it’s true. The only defense is systematic evaluation, run on every change, which is why the recurring advice across our agents work is to build evals on day one rather than bolting them on after the first quality incident. Vibes-based testing — a few people eyeballing outputs and judging by tone — is how those incidents happen.

The bridge between the two disciplines is one field. Every AI Gateway request returns a log ID (cf-aig-log-id), and you attach an evaluation — a score, a thumbs-up, structured feedback — to a specific call by that ID. The logging mechanics belong to the observability spoke; here, that ID is the hook evaluation hangs on.

What AI Gateway’s native evals give you today

Be clear-eyed about the current state, because it’s the thing most write-ups get wrong. AI Gateway has a native Evaluations feature, and it’s in open beta with exactly three evaluators: Cost (average inference cost across a dataset), Speed (average request duration), and Human feedback (the percentage of thumbs-up on the logs you’ve annotated). You build a dataset by filtering your logs — by model, provider, status, cost, metadata, and so on — and run the evaluators against it; results show the metrics and how many logs fed each one.

Here’s the honest callout, and it’s load-bearing: an automated or LLM-as-judge evaluator has not shipped. Cloudflare has signaled that richer evaluators using a model as a judge are coming, but as of this writing the native set is still cost, speed, and human thumbs-up, and the documentation still describes additional evaluators as a future addition. So if you want anything beyond those three — semantic correctness, faithfulness, rubric scores, regression gates — you build it yourself or integrate a dedicated tool. Treat that as the current reality and re-check it before you rely on it, because it’s exactly the kind of thing Cloudflare ships on a monthly cadence.

A few sharp edges worth knowing while you’re in there. Native datasets are capped at ten per gateway, and the dataset filters use AND logic — one value per filter, so you can’t build “model A or model B” in a single set. Datasets refresh as new logs match the filter, but evaluations don’t re-run automatically; you create a fresh evaluation to score new logs. And the feedback that drives the Human-feedback metric is captured three ways — from the dashboard, from a PATCH on the log ID via the API, or from the Worker binding patchLog() — where feedback is -1, 0, or 1, alongside an optional 0–100 score and a metadata object. One precise point: the native metric is computed from the binary thumbs-up/down only; the 0–100 score is stored but not surfaced as a native evaluator, so score-based dashboards are something you build.

// Capture production signal: attach feedback to a specific call by its log id
await env.AI.gateway("my-gateway").patchLog(logId, {
  feedback: 1,                  // -1 negative, 1 positive, 0 not evaluated
  score: 92,                    // 0-100, stored but not a native metric
  metadata: { reviewer: "ops" }
});

Building the rest yourself

Because the native evals stop at three, the substance of evaluation on Cloudflare is the build-it-yourself pattern — and the pieces are all on the platform, which is the real argument for doing it here. You need a judge, a dataset, and a runner.

The judge is a model that scores an output against a rubric. Run a Workers AI model with JSON mode so it returns a structured verdict — a score and its reasoning — rather than prose you have to parse, or route the judge call through AI Gateway to a frontier model if you want a stronger evaluator. The golden dataset — trusted inputs paired with ideal or reference outputs — lives in R2 as JSON or JSONL for large sets, or in D1 (serverless SQLite with native JSON functions) when you want to query inputs and results as structured rows. The runner is Cloudflare Workflows: a durable execution engine where each step’s result is persisted, so an eval job that iterates a dataset, generates a candidate, judges it, and stores the score survives failures and resumes mid-run instead of restarting. (The Workflows internals — steps, retries, sleeps — are the subject of our agents cluster’s durable-execution spoke; here it’s simply the eval runner.)

// Offline eval loop as a Workflow (conceptual)
const cases = await step.do("load-golden-set", () => loadFromR2(env));
for (const c of cases) {
  const out   = await step.do(`generate-${c.id}`, () => generate(env, c.input));
  const score = await step.do(`judge-${c.id}`,    () => judge(env, c.input, out, c.reference));
  await step.do(`persist-${c.id}`, () => saveToD1(env, c.id, score));
}
// fail the run if the aggregate score drops below your regression threshold

That loop is the heart of offline evaluation: run it in CI on every prompt or model change, and fail the build if the aggregate score regresses.

The concepts you need

Offline versus online. These are the two halves of a real eval practice, and Cloudflare gives you a home for each:

// Offline vs online evaluation — and where each lives on Cloudflare
	Offline eval	Online eval
When	Before deploy, on every change	In production, on live traffic
Against	A fixed golden dataset	Real user requests
Signal	An automated judge and metrics	Human feedback, guardrail checks
On Cloudflare	Workers + Workflows	The AI Gateway feedback API
Catches	Regressions before they ship	Drift and real-world failures

Golden datasets are the foundation, and they’re real work — trusted inputs with ideal outputs, usually human-labeled, versioned, and refreshed with hard cases pulled from production. This is not a one-afternoon task; teams routinely invest dozens of hours building a few hundred well-chosen cases, and that dataset is what later lets you swap to a cheaper model and prove quality held.

>80%

rate at which a strong LLM judge agrees with human preferences — about the rate humans agree with each other — but with measured position, verbosity, and self-preference biases to design around— Zheng et al., MT-Bench / Chatbot Arena (NeurIPS 2023)

LLM-as-judge, and its biases. Using a model to grade outputs scales evaluation, and a strong judge can agree with human preferences well over 80% of the time — about the rate at which humans agree with each other. But it is not ground truth, and it has measured biases you have to design around: position bias (favoring the first answer shown), verbosity bias (favoring longer answers), and self-preference (favoring its own family’s style). Position bias in particular is large — in one study, simply reordering two responses flipped the verdict, letting a weaker model “beat” a stronger one on 66 of 80 queries purely from order. The mitigations are concrete: swap the order and only count a win if it holds both ways, score against a reference answer, and use a different model family (or an ensemble) as the judge rather than the model you’re evaluating. Judges also drift — a minor model upgrade can silently shift scores while your CI gate keeps passing — so recalibrate periodically.

Regression testing ties it together: the golden set plus the judge, run on every change, with a pass/fail threshold wired into CI. That’s what turns “we think this prompt is better” into evidence.

Evaluating RAG and agents

These are harder than scoring a single response, and the article should say so plainly. For retrieval-augmented generation, you score two things separately, because they fail separately: retrieval quality (did the system fetch the right context?) and generation quality (was the answer grounded in that context and relevant to the question?). The widely used RAG triad names these as context relevance, faithfulness or groundedness, and answer relevance. An answer can be fluent and relevant yet unfaithful — confidently stating something the retrieved context doesn’t support — which is exactly the failure RAG eval exists to catch. On Cloudflare you pull the retrieved context from AI Search or Vectorize and judge faithfulness against it; the RAG internals themselves are covered in our agents cluster’s memory and grounding spoke, which is the place that argues, at length, for building RAG evals from the start.

For agents, you evaluate the trajectory, not just the final answer — did it pick the right tools, call them with the right arguments, in a sensible order, and stop when it should? A multi-step agent has many more places to go wrong than a single call, and a correct final answer can mask a wasteful or fragile path. Score trajectories either deterministically (the expected tools in the expected order) or with a judge reviewing the path against a reference.

Where Cloudflare sits — and why build evals here anyway

Be fair about the landscape. There’s a mature ecosystem of dedicated eval tooling — Promptfoo for CI-first testing (it has a native Workers AI provider), DeepEval for a large metric library, RAGAS as the RAG-eval standard, plus LangSmith, Braintrust, Arize Phoenix, and Langfuse for the full tracing-and-eval lifecycle. Cloudflare’s native evals are early by comparison, and the honest framing is that Cloudflare gives you building blocks plus a basic native feature — and the option to integrate any of those dedicated tools, since AI Gateway exports traces in OpenTelemetry format to backends like Langfuse and Phoenix. You are not locked into the native evals.

So why build here at all? Colocation. The judge (Workers AI or a frontier model via the gateway), the datasets (R2 or D1), the durable runner (Workflows), and the production feedback (the gateway’s log-ID API) all live in one account, on one network, on one bill, with per-use economics. Offline evaluation against golden sets and online evaluation from real traffic sit side by side rather than in a separate platform you have to wire up and pay for. Framed against the cluster: observability tells you what happened, cost control tells you what it costs, and evaluation tells you whether it’s good — and evaluation is the leg that lets you safely act on the other two, because it’s how you prove a cheaper model still holds quality before you switch to it.

The honest trade-offs

Native evals are early and limited — cost, speed, and human thumbs-up only; no LLM-as-judge yet. Sophisticated evaluation is build-it-yourself or integrate a dedicated tool.
The native dataset feature is constrained — ten per gateway, AND-only filters, and evaluations don’t auto-refresh.
An LLM judge is not ground truth. It carries position, verbosity, and self-preference biases, and can occasionally prefer wrong answers outright. Mitigate with order-swapping, reference scoring, and a different judge model — don’t trust it blindly.
RAG and agent evaluation are genuinely hard regardless of platform — retrieval and generation must be scored separately, and agent trajectories add whole new failure modes.
Golden datasets are real work to build and maintain, and they drift; budget for it and refresh from production.
Human feedback is sparse and noisy — the thumbs-up metric depends on users actually clicking, so low volume means high variance.
Evaluation is ongoing, not one-time. Regression-test every change and recalibrate judges as models update.
Consolidating judge, data, runner, and feedback on one platform is convenient and concentrates a dependency; OpenTelemetry export is the hedge.

Frequently asked

Does Cloudflare have an LLM-as-judge evaluator?

Not natively, as of this writing. AI Gateway's native evaluators are cost, speed, and human thumbs-up, and an automated judge is signaled as coming but hasn't shipped. You build an LLM-as-judge yourself on Workers AI, or integrate a dedicated eval tool. Re-check the docs before relying on it.

What's the difference between offline and online evaluation?

Offline runs before deploy against a fixed golden dataset — your regression test, built on Workers and Workflows. Online runs in production on real traffic, via human feedback and guardrail checks captured through the AI Gateway feedback API. You want both.

How do I capture quality signal from production?

Every gateway request returns a log ID; attach feedback to it through the dashboard, a PATCH on the API, or the patchLog() Worker binding — feedback as -1/0/1 plus an optional 0–100 score and metadata. That feeds the human-feedback evaluator and your own dashboards.

How do I build a golden dataset?

Collect trusted inputs paired with ideal outputs — seeded by hand and grown with real, hard cases from your logs — label them, version them, and store them in R2 or D1. Plan for it to be ongoing work; it's the asset every other eval depends on.

How do I evaluate a RAG system?

Score retrieval and generation separately: did it fetch the right context, and was the answer faithful to that context and relevant to the question? Pull the retrieved context from AI Search or Vectorize and judge faithfulness against it.

Working with Truvisory

If you’re shipping AI and quality is still decided by vibes, see how we build evaluated AI systems on Cloudflare — with golden datasets, an automated judge, and regression gates wired in, so “is it good?” is a number you can defend rather than a feeling.

Truvisory is a Denver-based AI and automation consultancy run by a senior operator — a combat veteran and former PE-backed operating executive — who ships working software, not strategy decks. Cloudflare-native by default, for both AI delivery and the back-office automation where the ROI lives.

Tony Adams is the founder of Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. SBA-verified SDVOSB and VOSB, SAM.gov-registered.

AI Evals on Cloudflare: How to Measure Whether Your AI Is Actually Good

Observability tells you what happened; evals tell you if it was good

What AI Gateway’s native evals give you today

Building the rest yourself

The concepts you need

Evaluating RAG and agents

Where Cloudflare sits — and why build evals here anyway

The honest trade-offs

Frequently asked

Working with Truvisory

More in this series

AI Observability, Cost, and Evaluation on Cloudflare: How AI Gateway Stops You Flying Blind

Cloudflare Audit Logs for AI: A Tamper-Evident, Compliance-Grade Record of Every AI Request

Latency Engineering for AI on Cloudflare: Cache the Hot Path, Stream the Rest, Route to Faster Models

Controlling AI Model Costs on Cloudflare: The Levers That Actually Reduce Token Spend

Cloudflare AI Gateway as Your Observability Layer: Every LLM Request, Logged and Queryable

Multi-Provider AI Routing on Cloudflare: Fallback, Retries, and BYOK That Keep Your App Up

One email a month. Not a vendor blog.