Featured · Build log

At 5% GPU utilization, the math doesn't work. Here's what does.

Tony Adams 14 min read May 2026

The Cast AI 2026 State of Kubernetes Optimization Report quietly buried the most important number in the AI infrastructure conversation this year.

Across roughly 23,000 production clusters running on AWS, Azure, and Google Cloud, average GPU utilization is 5%.

Not p10. Not “some teams.” Not “before they tuned it.” Average. Across the production fleet of the enterprise AI buildout. CPU utilization is 8%, down from 10% a year earlier. Memory utilization fell from 23% to 20%. The trend lines are pointing the wrong direction, and the GPU number is the one that will eat balance sheets alive.

If you are paying for reserved GPU and using 5% of it, you are buying a Ferrari to commute three miles, twice a week, with one passenger.

The reserved-capacity model that the entire enterprise AI stack was sold under — the model that justified the eleven-figure capex commitments Amazon, Microsoft, Alphabet, and Meta announced for 2026 — is, on the math, mostly empty space being expensed against P&L. The Cast AI team called the bottom line on this directly in their write-up: at an average utilization of 5%, the math doesn’t work, and the hoarding instinct that holds capacity you might not get back is what feeds the scarcity loop that drives prices higher in the first place.

This is the part of the AI infrastructure conversation that VCs, hyperscalers, and platform vendors don’t want named out loud, because the answer isn’t “we need better Kubernetes.” The answer is that the reserved-capacity model is structurally broken for the workloads we’re actually running. Pay-per-inference architecture isn’t a niche cost-saver — it’s the only model that survives a CFO doing the math at the next quarterly review. And the architectural pattern that makes pay-per-inference work — orchestrator-plus-scout, RLM-style, stateless inference with stateful coordination — is the same pattern that lets a principal-led team ship multi-agent systems that used to need a twelve-person infrastructure org.

Let me show you the math, the architecture, and the federal angle.

How we got here

Reserved GPU made sense in 2023. There was no other game in town.

If you weren’t on a yearly H100 contract by Q2, you weren’t shipping models. Capacity was scarce, the queue was long, and your competitor had a procurement officer on a first-name basis with somebody at the cloud provider’s account team. You bought big. You sized for the peak workload you might run next quarter, not the median workload you ran today. You sized to never be caught short. The capacity sat there ready, and the math was: a 60% utilized GPU you own beats an unavailable GPU you don’t.

That math held until two things happened simultaneously. First, the models got dramatically better at being smaller. Gemma, Llama 3, Nemotron, Mistral Small — the gap between a frontier model and a 7B-to-30B model on most enterprise workloads narrowed faster than anyone forecasted. Second, the inference-time tooling caught up. Speculative decoding, FlashAttention variants, paged KV cache, batched serving — the throughput per GPU-hour on production hardware roughly tripled between 2023 and 2025.

Both of those should have driven utilization up. Instead, utilization fell. Why?

Because every gain on the inference side made it easier for teams to provision more headroom, not the same headroom more efficiently. The infrastructure team shipped capacity to the application team. The application team built features against that capacity. Both teams optimized for “never run out.” Nobody got paid to leave GPUs unprovisioned. And the application teams still don’t know what their median load looks like, because the product is changing every two weeks and the workload mix is changing every quarter.

The result is what Cast AI’s data shows: organizations assigning roughly twenty times more GPU capacity than they actively use. Ninety-five percent of GPU capacity, on average, sitting idle. Not for a bursty afternoon. For the year.

The CFO arithmetic

An idle CPU core costs cents per hour. An idle GPU costs dollars.

In January 2026, AWS raised H200 Capacity Block prices by 15%. That is the part to sit with. GPU prices went up, not down. For the first time since EC2 launched in 2006, the unit economics of cloud compute moved the wrong direction for the customer. Cast AI noted this broke a two-decade precedent. The implication isn’t that AWS is being unreasonable — it’s that the demand-supply mismatch is structural now, and the price will continue to do what prices do in a constrained market: ration access through the wallet.

So let’s do the math for a realistic mid-market workload.

You’re running a customer-facing AI feature. Maybe it’s an in-product copilot, maybe it’s a document summarizer, maybe it’s a workflow agent. Your engineering org reserved two H100 nodes — call it $30k/month all-in across compute, networking, observability, and the headcount allocated to keeping the cluster healthy. At 5% utilization, you are doing actual productive AI inference for about 36 hours per month per node. Your effective cost per productive hour is roughly $416. Your CFO is paying Ferrari rates for a Honda Civic’s worth of trips, and the maintenance is on you.

Now run the same workload on a pay-per-inference platform. Workers AI on Cloudflare bills you for the tokens you actually generate. (I ran the same bill side-by-side for Workers vs AWS Lambda on an IO-bound AI workload — the gap is a multiple, not a rounding error.) The hardware management surface goes to zero — you don’t have a cluster, you don’t have a queue, you don’t have a node pool, you don’t have a capacity planner. You write code that calls a model. The model runs in one of 330+ cities of presence, close to wherever the request came from. If you do no inference this hour, you pay zero. If you do a million inferences this hour, you pay for a million inferences and the platform handles the placement, batching, and hardware.

The unit economics flip. You no longer amortize a box. You match cost to value, request by request. The CFO’s question goes from “are we using what we bought?” to “is each call worth what it costs?” — which is the question the CFO should have been asking all along. The CTO’s question goes from “did we size the cluster right?” to “does this feature make money?” — which is also the question the CTO should have been asking all along.

The hyperscalers will tell you reserved capacity is cheaper per unit at scale. That is true in the same sense that a Costco-sized jar of mayonnaise is cheaper per ounce: it is, until you account for the mayonnaise that goes bad in your fridge. At 5% utilization, 95% of your GPU mayonnaise is going bad in your fridge.

Why most teams can’t migrate (yet)

If pay-per-inference is so much better, why isn’t everyone already there?

Four reasons, in roughly the order they show up on a whiteboard when I’m consulting with a team that wants to move.

One: sunk hardware. A lot of teams signed multi-year reserved instances or made capex commitments to their cloud provider that they can’t unwind without paying through the nose. The contract was a bet on continued scarcity, and the bet is now underwater. Migration plans get written, then shelved, because nobody wants to be the VP who walked away from $4M of pre-paid capacity even when continuing to use it costs more than walking away.

Two: latency assumptions. “We need our model running in our region for latency.” That was a real constraint in 2023. It is mostly not a real constraint in 2026. The edge inference platforms — Workers AI specifically — run in hundreds of cities. The median first-token latency from a Workers AI inference is, in my testing on real production workloads, often better than from a centralized cluster in the same continent, because the request hits a model running ten milliseconds away instead of in a single us-east-1 datacenter behind a load balancer behind an API gateway. Latency is no longer the moat people think it is.

Three: vendor lock-in on inference. A lot of teams built their first AI feature against OpenAI’s API. Then they built the next one. Then the next one. Then they realized the bill was real and decided to “go self-hosted” for cost reasons, which is how they ended up with the 5%-utilized cluster in the first place. The migration to pay-per-inference requires a portable inference layer — which is exactly what MCP, OpenAI-compatible APIs, and standardized tool surfaces have quietly built over the last 18 months.

Four: skill atrophy. Your ML platform team has spent the last 18 months optimizing a stack that pay-per-inference dissolves. Kubernetes operators for GPU scheduling. Custom autoscalers. NCCL tuning. vLLM deployment pipelines. None of that is wrong — it’s just not the work that pays the rent anymore for most product teams. The platform team rationally resists the architectural shift that makes their day-to-day expertise less central. This is a human problem, not a technical one. The technical answer is to redeploy the platform team against orchestration and evaluation, where there’s still real engineering depth required. The human answer is to be honest about it.

None of those four reasons changes the math. They just slow the migration. The teams that move first capture a compounding advantage that the teams that move later have to pay to catch up to.

What architecture survives this

Here’s where it gets interesting, because the architectural answer to the 5% problem isn’t “smaller GPUs” or “better autoscalers.” It’s a different shape entirely.

The pattern I keep returning to — the one HotCopy is built on, the one that lets me ship recursive multi-agent systems solo — is orchestrator-plus-scout. One bigger, more expensive model does the planning. Many smaller, cheaper models do the parallel scouting and execution. The orchestrator decomposes the problem, dispatches sub-problems to scouts running in parallel, and recombines their outputs.

The architectural lineage of this pattern is now formalized. In December 2025, Alex Zhang, Tim Kraska, and Omar Khattab at MIT CSAIL published the Recursive Language Models paper (arXiv:2512.24601), which treats the prompt itself as an external programmable environment that the model can decompose, examine, and recursively call itself over. I put that pattern through a real production job on Workers AI — a multi-day refactor compressed to six minutes for under $3 of inference. The RLM paper demonstrates that this pattern handles inputs two orders of magnitude beyond model context windows and outperforms vanilla frontier LLMs on long-context tasks at comparable or lower per-query cost. The technical insight is significant in its own right. The infrastructure implication is the part most readers missed.

If you build an orchestrator-plus-scout system on reserved GPU, you are reserving capacity for the peak parallel scout fan-out you might hit. Which, given the variance in agent workloads, is enormous. You will provision for the worst case and run at single-digit utilization on the median case, and now you’ve reproduced the 5% problem inside your own architecture.

If you build the same system on pay-per-inference, the architecture you wrote is the architecture you pay for. Twenty scouts in parallel costs twenty units. Two scouts in parallel costs two units. The thing nobody told you about pay-per-inference is that it doesn’t just change the cost model — it changes what architectures are economically viable. Recursive, fan-out, multi-agent patterns become cheaper on pay-per-inference precisely because the platform absorbs the variance. On reserved capacity, those same patterns are economically punitive, which is why the agent frameworks that emerged in 2024 mostly ran sequentially against a single big model: the architecture was bent to fit the cost surface.

What you need underneath to make this work is three primitives. They exist. They’re documented. They run today.

Stateless inference at the edge. Workers AI for the inference itself. The platform handles model placement, batching, and hardware. Your code doesn’t know what GPU it’s running on, and that’s the point.

Stateful coordination in lightweight durable runtimes. Cloudflare Durable Objects, one per agent instance, each with its own embedded SQL database and hibernation semantics. The agent wakes when something happens, reads its durable state, does work, and hibernates when idle. You don’t run servers. You don’t reserve capacity. Each Durable Object stays alive as long as it’s processing, then goes dormant. The Agent class in the Cloudflare Agents SDK is built directly on this primitive — DurableObject > Server > Agent — and it’s the most important infrastructure design choice for multi-agent systems I’ve seen since Docker.

Standardized tool surfaces. Model Context Protocol (MCP) as the contract between your agent and whatever it’s reaching out to — your CRM, your filesystem, your search index, your billing system. MCP shipped in November 2024 and the ecosystem moved fast: thousands of servers, SDKs in every major language, adoption by OpenAI and Google. The clean tool-surface contract is finally making “agent that uses your tools” boring, which is exactly what you want. Boring infrastructure is shippable infrastructure.

That’s the stack. Stateless inference. Stateful coordination. Standard protocol surfaces. Three layers, three primitives, one principal can hold the whole thing in their head.

Why solo and small teams win this

The 12-person infrastructure org that justified itself in 2023 is now a tax on the P&L for a lot of mid-market companies.

This isn’t a knock on infrastructure engineers — I’ve been one, I’ve hired them, I’ve run teams of them. It’s an observation about what the work actually looks like in 2026 versus 2023. In 2023, somebody had to set up the Kubernetes cluster, write the GPU operator, tune the vLLM deployment, build the autoscaler, write the observability pipeline, manage the model registry, design the inference gateway, build the queueing layer. That was real, hard, valuable work. It justified a team.

In 2026, on a pay-per-inference + edge-stateful-runtime + MCP stack, most of those layers are either provided by the platform or replaced by a primitive that does the same job in one config file. A principal engineer who knows the platform can stand up a multi-agent system in two weeks that would have taken a team a quarter in 2023. The ratio of engineering output to engineering headcount has moved by a factor most companies haven’t internalized yet.

This is what HotCopy is. A managed recursive AI coding CLI, built on the orchestrator-plus-scout pattern, running on a pay-per-inference inference layer with stateful coordination in Durable Objects. The infrastructure is invisible to the user, because the user doesn’t have any. There’s no cluster. There’s no queue. There’s no node pool. There’s a CLI that does work, and behind the CLI is an architecture that pays for what it uses and goes dormant when it doesn’t.

The product matters less than the proof-of-concept the build represents. One person can do this now. Not “one person with venture funding and an SRE on retainer.” One person.

The federal angle nobody is talking about

Here is the part that has not yet registered with most federal contractors but is going to in 2026.

On April 3, 2025, the Office of Management and Budget published M-25-21 (Accelerating Federal Use of AI through Innovation, Governance, and Public Trust) and M-25-22 (Driving Efficient Acquisition of Artificial Intelligence in Government). Read together, the two memos describe a federal AI acquisition posture that explicitly prioritizes vendor-portable systems, fixed-scope deliverables, ongoing testing and monitoring rights, and protection against vendor lock-in. M-25-22 directs agencies to consider vendor lock-in at every stage of the AI acquisition lifecycle — initial demonstrations, solicitation provisions, contract awards, ongoing data access — and requires solicitation provisions for knowledge transfer, data and model portability, and licensing and pricing transparency.

If you have read those two memos as a federal AI vendor and you are still pitching reserved-cluster, vendor-specific, single-region AI deployments, you are mispricing your own risk. The acquisition guidance reads like a buying spec for the kind of architecture I just described. Stateless inference that can run anywhere. Standard protocol surfaces (MCP) that don’t lock the agency into one vendor’s tool ecosystem. Per-request cost transparency. Fixed-scope, deliverable-based engagements rather than reserved-capacity contracts that bill regardless of usage.

The SDVOSB and small-business set-aside lanes are about to be the most interesting place in federal AI procurement, because the small vendors are structurally better positioned to deliver on the new acquisition spec than the integrators that built their federal AI practices on reserved-cluster sales. The integrators will adapt — they always do — but the adaptation cycle is 18 to 24 months, and the small vendors that show up next quarter with the right architecture are going to capture the wedge.

This is the consulting wedge I’m pushing at Truvisory®. Not “we’ll help you do AI.” Specifically: federal AI modernization that is Cloudflare-native, MCP-first, pay-per-inference, vendor-portable, and structured to satisfy M-25-21/22 from day one. That’s a buying spec the federal market is going to be looking for, with a small-business set-aside attached, and most of the existing federal AI vendors are not positioned to bid against it.

The new arithmetic

The thing I want you to take from this isn’t “Cloudflare good, Kubernetes bad.” Kubernetes is a good piece of software. The issue isn’t with the orchestrator. The issue is with the acquisition model underneath the orchestrator.

The new arithmetic isn’t “how do we get GPU efficiency from 5% to 30%?” The new arithmetic is “do we need a GPU on the books at all?” For most teams shipping AI features in 2026, the answer is no. The platform owns the hardware. You own the application logic. You pay for what you used, you sleep when you’re not using it, you scale to zero when traffic dies, and you scale to whatever the workload requires when it doesn’t.

The teams that internalize this in 2026 will spend the next two years compounding an efficiency advantage their competitors can’t match. The teams that hold the reserved-capacity line will spend the same two years explaining to their CFOs why their AI cost per customer keeps going up while their utilization keeps going down. Those conversations are going to get short.

Cast AI’s report ends with the observation that workloads change, traffic patterns shift, and the configuration that was accurate six months ago is unlikely to remain accurate today. That’s true on Kubernetes, and it’s the reason 5% is the average. It’s also the reason the architecture that wins isn’t a smarter rightsizing pass on the same model. It’s a different model entirely, one where the platform absorbs the variance and you absorb the value.

Workers AI on Cloudflare. Durable Objects for coordination. MCP for tool surfaces. RLM-style orchestrator-plus-scout for the workload shape. Pay-per-inference for the bill.

The math, finally, works.

Tony Adams is the founder of HotCopy and Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. Verified SDVOSB and VOSB, SAM.gov-registered.

At 5% GPU utilization, the math doesn't work. Here's what does.

How we got here

The CFO arithmetic

Why most teams can’t migrate (yet)

What architecture survives this

Why solo and small teams win this

The federal angle nobody is talking about

The new arithmetic

AI Audit Logging on Cloudflare: Building a Tamper-Evident, Compliance-Grade Record of Every AI Request

Latency Engineering for AI on Cloudflare: Cache the Hot Path, Stream the Rest, Route to Faster Models

Controlling AI Model Costs on Cloudflare: The Levers That Actually Reduce Token Spend

One email a month. Not a vendor blog.

How we got here

The CFO arithmetic

Why most teams can’t migrate (yet)

What architecture survives this

Why solo and small teams win this

The federal angle nobody is talking about

The new arithmetic

More from Cloudflare

AI Audit Logging on Cloudflare: Building a Tamper-Evident, Compliance-Grade Record of Every AI Request

Latency Engineering for AI on Cloudflare: Cache the Hot Path, Stream the Rest, Route to Faster Models

Controlling AI Model Costs on Cloudflare: The Levers That Actually Reduce Token Spend

One email a month. Not a vendor blog.