Skip to main content
Truvisory
Cloudflare-Native

Controlling AI Model Costs on Cloudflare: The Levers That Actually Reduce Token Spend

Tony Adams 9 min read

AI inference cost is not a fixed tax — it’s a set of knobs, and most teams never touch them. On Cloudflare you turn them through a single one-line gateway plus Workers AI: cache identical requests so repeat calls never bill the provider; rate-limit and set spend limits that hard-stop runaway cost; route simple work to cheaper models and reserve flagships for the hard tasks; attribute spend per user, team, and feature so you know where the money goes; and treat the gateway’s cost number as an estimate you reconcile against the provider’s real invoice. Five levers, one control plane.

This is the cost-control spoke of our AI observability, cost, and evaluation cluster — the “predict and reduce it” layer. It’s the sibling of the observability spoke, which shows you cost as a metric; this piece is about the levers that bring that metric down.

What you actually pay for

Before the levers, the cost model. With third-party providers you pay per token — input and output separately, with output priced several times higher than input (Anthropic holds a flat 5x output-to-input ratio across Claude tiers; OpenAI’s ranges from roughly 3x to 8x by model). That single fact sets the highest-leverage knob: trimming output tokens moves the bill more than trimming input. With Cloudflare’s own Workers AI, you pay in Neurons — a unit of GPU compute — at a published per-model rate, with a small daily free allowance.

One boundary to set up front, because it’s a different question entirely: this piece is about model and inference cost — tokens and Neurons. The compute cost story — how Workers’ CPU-time billing beats Lambda’s wall-clock model, the egress math — lives in our existing Workers-versus-Lambda cost comparison. I won’t re-derive it here.

Lever one: cache to stop paying for the same answer twice

AI Gateway can serve an identical request from cache instead of calling the provider, and a cache hit costs you nothing in provider tokens — that’s the saving. The mechanism is the thing to understand, because it sets your expectations: the cache key is a hash of the entire request, so caching is exact-match. Any difference at all — a changed word, a different parameter, a reordered tool list — is a separate cache entry and a separate miss. You control it with a TTL header (cf-aig-cache-ttl, from sixty seconds up to a month), you can read whether a response was served from cache (cf-aig-cache-status returns HIT or MISS), and you can skip or custom-key the cache per request.

Here’s the honest limit, stated plainly because it determines whether caching is even worth turning on for you: semantic caching is not shipped. Cloudflare has said it plans to add similarity-based caching to improve hit rates, but today only identical requests hit. That makes exact-match caching excellent for deterministic or fixed-option workloads — a support bot with a bounded set of answers, repeated dev-and-test calls — and close to useless for naturally-worded user queries where no two prompts are byte-identical. (Don’t confuse this with AI Search, a separate Cloudflare product that does have similarity caching.) If your cache-hit rate sits near zero after a week, your prompts are too varied for this lever, and you lean on cheaper-model routing instead.

Caching reduces latency as well as cost, but that’s the caching-and-latency spoke’s subject — TTL strategy, staleness, and the performance side live there. Here it’s purely a cost lever.

Lever two: cap runaway cost with rate and spend limits

There are two different caps, and the distinction matters. Rate limiting caps request volume — N requests per window, fixed or sliding, returning a 429 when exceeded — which prevents the runaway-loop and abuse scenarios that produce shock bills. But volume is only a proxy for cost. The true dollar cap comes from Unified Billing’s spend limits: set a daily, weekly, or monthly limit, and the gateway automatically stops processing requests when you hit it, until the period resets or you raise the cap. That’s a real budget guardrail, not an approximation.

Two things to know about Unified Billing before you adopt it. First, there’s a fee: Cloudflare applies a 5% charge on credits purchased through it — a $100 credit purchase is billed at $105 — though the per-token provider rates themselves pass through with no markup. Second, it’s in open beta, and it only applies to third-party provider models; Workers AI models route around it and bill through Workers AI pricing instead. So if your workload is mostly on-platform models, Unified Billing’s spend caps won’t cover it, and you lean on rate limiting plus dashboard alerts.

A tuning note worth carrying into production: set the daily spend limit at roughly 1.5x your observed median daily spend, so legitimate spikes pass but a runaway halts — and pair it with alerting, because a hard stop you didn’t expect is its own kind of outage.

Lever three: attribute spend so you know where it goes

The gateway’s cost metric tells you what you’re spending in aggregate. To learn whose spending, or which feature’s, you tag each request with custom metadata via the cf-aig-metadata header — a small object of up to five string, number, or boolean values — and then filter your cost data by those keys. Cloudflare runs exactly this pattern on its own internal stack, mapping each user’s email to an anonymous identifier and passing only that, which gives per-user cost tracking without exposing identities to the provider or the logs. That’s the difference between “our AI spend is up” and “this one feature for this one segment regressed” — and it’s the step most teams skip.

For accurate internal chargeback there’s a second header: cf-aig-custom-cost lets you override the gateway’s default pricing with your actual negotiated per-token rates, so the numbers reflect the deal you signed rather than list price. The mechanics of the logs and dashboard that surface all this belong to the observability spoke; here the point is the use — attribution and chargeback.

Lever four: route simple work to cheaper models

Not every request needs your most expensive model. Routine work — classification, extraction, summarization, routing itself — runs fine on small, cheap models, and reserving flagships for genuinely hard reasoning is one of the largest levers available; routing by complexity alone is commonly reported to cut spend on the order of 40 to 70%, because most requests in a typical workload are routine.

40–70%
commonly reported model-spend reduction from routing routine requests to cheaper models — an illustrative industry estimate, not a Cloudflare guarantee — Industry cost-optimization analysis, 2026

The price spread is dramatic: Claude Haiku 4.5 runs $1 per million input tokens against Opus’s $5, and OpenAI’s GPT-4.1 Mini and Nano sit far below the full model. AI Gateway’s dynamic routing makes this conditional — route by metadata, enforce per-period budget caps, fall back when exceeded — but the routing mechanics and multi-provider strategy are the routing spoke’s territory. Here it’s the cost rationale.

The non-negotiable caveat: cheaper models can be worse, and you only know whether the downgrade held quality by measuring it. So this lever must be paired with evaluation — the evals spoke covers how. Ship a cheaper-model swap only once your evals confirm quality holds, and remember that a budget model needing two retries to reach a flagship’s single-pass answer isn’t actually cheaper.

Lever five: cheap inference on the platform itself

For many tasks the cheapest path is to move the model onto Cloudflare. Workers AI bills in Neurons at $0.011 per thousand, with ten thousand Neurons free every day, and the per-model rates put small models far below frontier third-party pricing — Llama 3.2 1B, for instance, runs about $0.027 per million input tokens and $0.20 per million output, an order of magnitude under the big hosted models. For text classification, extraction, and routing-style work, that’s often both the cheapest option and the tidiest one, since the inference stays inside the same control plane your caching and attribution already cover, with no cold starts and edge proximity. The free daily allowance is genuinely small, so it covers prototyping rather than production volume — but the per-Neuron economics hold as you scale.

The number you can’t fully trust (and what to do about it)

This is the most important honesty in the piece, so here it is without hedging: the gateway’s cost figure is an estimate, not your bill. Cloudflare states it directly — the cost metric is “an estimation based on the number of tokens sent and received,” and you should refer to your provider’s dashboard for accurate detail. Concretely, that means it’s computed from token counts against public list pricing, so it won’t reflect your negotiated or committed-use discounts unless you set custom costs; cache hits show as zero; free-tier usage skews it; and it only appears for models that return token data in their responses.

None of that makes it useless — it’s an excellent tool for spotting trends, forecasting, and catching a runaway early. It just isn’t ground truth. The discipline is to set your negotiated rates via custom costs so the estimate stops understating discounts, and to reconcile against each provider’s actual invoice monthly. If the estimate and the invoice diverge by more than about 10%, audit for missing custom costs, cache or free-tier skew, or unsupported models. Treat the gateway number as your daily instrument and the provider invoice as the source of record.

The honest trade-offs

  • Cost is an estimate, not billing truth — reconcile against provider invoices, and set custom costs so negotiated discounts show up.
  • Caching is exact-match only. Semantic caching isn’t shipped, so naturally-worded queries rarely hit and achievable savings are capped. Great for deterministic or fixed-option workloads, weak for varied prompts.
  • Cheaper models risk quality. This lever only works paired with evaluation; a budget model that retries can cost more than a flagship that doesn’t.
  • Unified Billing carries a 5% credit fee and is in open beta, and it doesn’t cover Workers AI models, which bill separately. Factor the fee and the boundary into the build-versus-buy math.
  • Rate limiting caps volume, not dollars, and can reject legitimate traffic if mis-tuned; spend limits hard-stop traffic when hit, so pair both with alerting.
  • The Workers AI free allowance is small — ten thousand Neurons a day covers prototyping, not production.
  • Every lever depends on observability to verify — without attribution you’re optimizing blind, which is why instrumentation comes first.
  • Consolidating cost control on one provider is convenient and concentrates a dependency. Worth weighing honestly.

Concrete patterns

Turn on caching with a TTL header — cf-aig-cache-ttl: 3600 — and read cf-aig-cache-status on the response to track your hit rate; skip per request with cf-aig-skip-cache: true. Make the cost numbers honest by sending your negotiated rates as a custom cost — cf-aig-custom-cost with per-token input and output values — noting that cache hits always record zero regardless. Cap runaway volume with a per-gateway rate limit (an interval, a limit, and fixed or sliding technique; over-limit returns 429), and cap actual dollars with Unified Billing daily/weekly/monthly spend limits in the dashboard. And attribute every request by stamping cf-aig-metadata with a team, user, and feature, then filtering your cost data by those keys. Keep the metadata deliberate, set custom costs to match your contracts, and instrument before you tune.

Frequently asked

What is the single biggest lever to cut AI cost?
For most teams, model routing — sending routine work to a cheap model and reserving the flagship for hard tasks, commonly cutting spend 40 to 70%. After that, capping output tokens (output is priced several times higher than input) and caching repeatable requests.
Does the gateway tell me exactly what I'm spending?
No — it's a close estimate from token counts on list pricing, which is great for spotting trends but won't reflect negotiated discounts unless you set custom costs, and won't match your invoice exactly. Reconcile monthly against the provider's billing.
Will caching actually save me money?
Only if your requests repeat byte-for-byte. AI Gateway caching is exact-match — semantic caching isn't shipped — so it's excellent for deterministic or fixed-option workloads and close to useless for varied, naturally-worded prompts. Check your cf-aig-cache-status hit rate before counting on it.
How do I stop a runaway bill?
Two caps. Rate limiting bounds request volume (a backstop against loops and abuse), and Unified Billing spend limits set a true daily/weekly/monthly dollar cap that auto-stops traffic when hit. Set the daily cap near 1.5x your median daily spend and pair it with alerting.
Is moving to cheaper models safe?
Only if you measure it. Cheaper models can degrade quality, so pair any downgrade with evaluation and ship the swap only once quality holds. A model that needs extra retries isn't the bargain it looks like.

Working with Truvisory

If your AI bill is climbing and you can’t see why, see how we build cost-instrumented AI systems on Cloudflare — with attribution, caching, and spend caps wired in so the bill is a number you control, not one you discover.

Truvisory is a Denver-based AI and automation consultancy run by a senior operator — a combat veteran and former PE-backed operating executive — who ships working software, not strategy decks. Cloudflare-native by default, for both AI delivery and the back-office automation where the ROI lives.