# Truvisory — full content corpus

> Verified-SDVOSB AI consultancy shipping production systems on Cloudflare's edge — federal set-aside delivery and fixed-scope commercial builds. Denver, Colorado. SAM.gov UEI: KNZKX28MLC42 · CAGE: 9X1L4 · NAICS 541512 primary.

## Pages

### / (Home)

§ 06 / Not sure which track? 

Talk to our AI Agent. Same one that answers the phone.

Tell it about your project, ask anything about Truvisory®, or call +1 (303) 495-5859 for the voice version. Either path routes to Tony with a 24-hour reply.

§ 01 / Who's at the table · Federal 

If you're the one who has to actually get an AI
capability on contract this fiscal year.

Not a fit if You need a 200-person body shop, you're shopping on rate alone, or your timeline is "next FY"
with no funding line identified.

§ 01 / Who's at the table · Commercial 

If you're the operator who has to make AI actually show
up in the P&L next quarter.

Not a fit if You're pre-revenue, you want offshore staff aug, or you're looking for someone to write a
roadmap you'll never execute.

// Tony Adams · Founder 
SGT (E-5) · Infantry · Afghanistan 

§ 04 / Founder 

"I'm not a former Big 4 consultant who studied AI. I ran the businesses, automated
them, and ship the systems. Twenty-five years of operator scar tissue — from a salon I
built to a PE-backed franchise I ran — is the difference between a deck and a deployment."

Tony Adams, MBA · Founder · Truvisory® · U.S. Army Combat Veteran

★★★★★ 
5.0 
47 verified · Google 

Read the long version →

Mission&#8209;grade AI systems. Built and shipped on Cloudflare's edge _ 

§ trust 
★★★★★ 
5.0 / 5 
47 verified five-star reviews on Google 

Track A · Federal & DoD 

You're a contracting officer with AI modernization budget and an SDVOSB box to check.
We're a direct-award answer that doesn't require a recompete.

Built for: KOs, PMs & PEOs at DoD, VA, DHS, and civilian agencies — plus primes hunting an SDVOSB sub for OASIS+, CIO-SP4, SEWP, ITES.

Enter Federal track → 

UEI · KNZKX28MLC42 NAICS 541512 / 541511 / 541611 

Track B · Commercial & PE 

You're a founder, COO, or PE partner at a small business or $20–500M operator. You've
paid for one AI deck too many. You want shipped software in 60 days.

Built for: Founders, COOs, CTOs & VPs Ops at small businesses, mid-market services, logistics,
healthcare, and financial — and PE portfolio operators with a 1–2 quarter mandate.

Enter Commercial track → 

Free 30-min AI Audit Calendar / Discovery

§ 05 / Built. Shipped. Used. 

Proof of capability is product, not PowerPoint.

Q: How does the new pricing model work for accounts <$10K?

Slides 12–14. Same per-seat ladder, $99 floor.

Will this be on the recording?

Yes — sent at 4pm CT.

PRESENGAGE SMS · LIVE 

Case 01 · Multi-platform AI 

PresEngage — patent-pending AI Co-Presenter for live audiences.

Real-time SMS Q&A, trained on the speaker's own deck. Microsoft Teams, Google Meet,
and Zoom integrations.

Patent pending Production 

HOTCOPY.AI RLM · MoE · 1T 

Case 02 · Cloudflare-native 

HotCopy — a Recursive Language Model coding CLI on Cloudflare's edge.

Kimi K2.6 (1T MoE) orchestrating Gemma scout workers, derived from MIT CSAIL
recursive-LM research. Running production tooling, sub-50ms response from 330+ cities.

Stack: Workers AI · Durable Objects · Vectorize 2026 → 

▸ Incoming +1 (303) 495-5859 
● Live · 24/7 

"Hi, I'm Truvisory's voice agent. Tell me about your project — I'll route you,
schedule a call, or answer pricing questions right now."

TRUVISORY VOICE VOICE · SMS · 24/7 

Case 03 · Live demo · Call now 

AI Telephony & SMS Agents — talk to one right now.

Call +1 (303) 495-5859 to talk to our knowledgeable AI voice agent about
your project. Same engine powers our client telephony and SMS automations.

Tap card to dial · 24/7 Live →

§ 03 / Two tracks 

One mission. Two audiences. Pick the door that fits.

Federal SDVOSB · DSBS 

AI & Cloudflare-native systems, delivered under SDVOSB set-aside.

a]:w-auto"
>
Capability Statement → 
Enter Federal Track 

Commercial Mid-market 

Stop buying AI strategy decks. Start shipping AI systems.

a]:w-auto"
>
Book AI Audit → 
Enter Commercial Track 

Watch · 62s

### /federal/

§ Federal Track / SDVOSB 

AI and Cloudflare-native systems, delivered under SDVOSB set-aside.

Truvisory® LLC is a Service-Disabled Veteran-Owned Small Business
( SDVOSB )
and VOSB ,
with active SAM.gov 
registration — ready for direct award under VA Veterans First and
SBA 
SDVOSB sole-source pathways up to $5M.

Download One-Page Capability Statement ↓ 

Schedule 30-min Briefing → 

// Dossier 

Legal 
Truvisory® LLC 
UEI 
KNZKX28MLC42 
CAGE 
9X1L4 
State 
Colorado · 2018 
SAM.gov 
Active · Verified 
SDVOSB 
Certified · VetCert 
VOSB 
Certified 
CMMC 
L1 Self-Assessed · L2 roadmap 

Primary 
NAICS 541512 
Also 
541511 / 541611 / 541613 / 541618 / 541690 / 541990 / 611420 
SBA 
Small Business · SDVOSB · VOSB 

Experience 
Period 
Role 
Scope 
Outcome 
Reference 

MapMatix 
CRM & automation · CTO · Denver 

Jul 2025 – May 2026 
CTO 
Architect of the technology platform behind a CRM-optimization & business-automation practice — custom integrations, workflows, and full custom CRM builds. 
$32K avg recovered per CRM audit. 
POC on req. 

Daddy's Chicken Shack Franchises 
PE-backed multi-unit franchise 

Mar – Oct 2024 
President 
P&L, growth, and tech end-to-end. Site overhaul on Cloudflare-secured static stack; AI/ML inventory + KPI platform; automated marketing infra; trademarks Food Like The Photos® & Franchise Your Future® . 
10× web traffic; 80,000+ signups; Zapier Award winner . 
POC on req. 

Port of Subs 
Acquired sub-sandwich franchisor 

Apr 2023 – Oct 2024 
CTO 
100% of pre-acquisition tech diligence + complete post-acquisition tech overhaul: loyalty (PAR Punchh), site, kiosks, 2FA, ERP migration to cloud. 
3× web traffic; legacy → cloud ERP. 
POC on req. 

Area 15 Ventures 
PE-backed franchising portfolio (500+ locations) 

Jun 2022 – Oct 2024 
VP · Chief of Staff 
Mortgage-franchise partnership architecture, custom pro-forma engine, M&A navigation across portfolio. Promoted to President of Daddy's Chicken Shack. 
Portfolio M&A executed. 
POC on req. 

Motto Mortgage Plus 
Multi-state mortgage broker 

Oct 2020 – Nov 2022 
COO 
Built internal platform that 3× processor output; created pre-approval app + lead-gen site. Mortgage operations at scale. 
UWM Innovation Award; Fortune / GPTW. 
POC on req. 

RE/MAX (NYSE: RMAX) 
120,000+ agent franchisor · world HQ 

Feb 2016 – Apr 2018 
Head of RR&CX 
Launched the Recruiting / Retention / Customer Experience division; ran NPS programs across US & Canada presented to the Board; supported the Motto Mortgage launch. 
38% NPS response rate; 50 Motto franchises in yr 1. 
POC on req. 

U.S. Army 
Infantry Team Leader & Combat Medic 

Aug 2007 – Aug 2011 
Sergeant (E-5) 
Led seven-soldier teams in deployment ops; combat tour during Afghan elections against heavy resistance. 
Zero casualties during election ops. 
DD-214 

// Performance under FAR 15.305(a)(2)(iv): for federal entrants,
FedBiz Access guidance permits operator-track and PE-portfolio
references. References are real, recent, relevant, and contactable on
request under NDA.

★

Combat-veteran founder

U.S. Army Infantry Team Leader & Combat Medic. Sergeant (E-5).
Afghanistan. Discipline as an operating system, not as marketing
copy.

$

20+ yrs P&L

Franchising, PE-backed ops, COVID-era state economic response.
Operator scar tissue, not consulting theory.

Cloudflare-native

Workers AI, Agents SDK, Durable Objects, Vectorize. Rare in the
SDVOSB cohort.

→

Direct principal delivery

No junior staff aug, no offshore handoff. The principal you brief
is the principal who ships.

Direct

Teaming & IDIQ

The top eight federal agencies account for over 82% of all SDVOSB dollars in FY2025. We concentrate BD on agencies with mission alignment to
AI/automation modernization and an active set-aside posture.

§ 07 / Engage 

Three ways to start a conversation.

Capability Statement (PDF) ↓ 

30-min Briefing → 

Submit RFI / Sources Sought

### /commercial/

§ Commercial Track / Mid-market 

Stop buying AI strategy. 
Start shipping AI systems.

Truvisory® builds and ships AI agents, RAG systems, and Cloudflare-native automations for mid-market
operators — typically in 30 to 90 days, billed as fixed-scope
engagements, not retainers.

Book a 30-min AI Audit → 

See the Cloudflare stack → 

Available · Capacity open

§ trust 
★★★★★ 

5.0 / 5 

47 verified five-star reviews
on Google 

§ 01 / The internal monologue 

What we hear in the first ten minutes of every sales call.

§ 02 / Productized engagements 

Three packages. Fixed scope. Transparent ranges.

Package A 2 weeks 

AI Audit & Automation Roadmap

From $5K – $30K · fixed

▸ Process audit across sales, ops, CX

▸ Prioritized AI / automation backlog

▸ ROI sizing per opportunity

▸ Cloudflare or non-Cloudflare reference architecture

▸ Build · buy · partner recommendation

// Outcome: a build plan you can hand to any team, including ours.

Start with audit → 

Most chosen 

Package B 4–6 weeks 

Ship-It Sprint

From $20K – $120K · fixed

▸ One production-grade AI agent or automation

▸ End-to-end: UX → model → orchestration → guardrails → observability → rollout

▸ Built on your stack or Cloudflare-native edge

▸ 30-day handover support window

▸ Runbook, deploy scripts, on-call playbook

// Outcome: a working system in production. Not a prototype.

Start a sprint → 

Package C Monthly 

Embedded Fractional CTO / AI Lead

From $8K – $18K / mo

▸ Capped hours, transparent rate

▸ Weekly working time + monthly architecture review

▸ Hiring & vendor management

▸ Direct hands-on shipping when speed matters

▸ No FTE overhead, no retainer trap

// For founder-led $10M – $100M companies without an in-house
senior AI engineer.

Discuss fit → 

§ 03 / Service lines 

How the packages map to specific work.

§ 04 / Industries 

Where the operator résumé is the moat.

§ 05 / Comparison 

Why Truvisory® vs. an agency.

Truvisory® 
Typical AI agency 

§ 06 / In the founder's voice 

"If you've already paid for the strategy deck, the next call
shouldn't be another deck. It should be a build. We do builds."

— Tony Adams · Founder, Truvisory®

Book the AI Audit → 

See the work → 

§ 07 / Book a 30-min AI Audit 

A working call, not a discovery call.

You bring one process you wish were automated. We come with a
working hypothesis on the architecture, a stack pick, and a
fixed-scope ballpark. No SDR . No drip campaign. No "next steps" deck.

→ 30 min · Tony directly

→ Calendar booking — pick a time, name + email, no SDR drip

→ Pre-call prep: short Loom on the process you want to ship

→ Post-call: 24-hour written summary, no obligation

### /cloudflare/

§ Cloudflare-Native 

Cloudflare-native AI, by an engineer who lives on the platform .

We design, build, and ship production AI systems on Cloudflare's edge
— Workers, Workers AI, Agents SDK, Durable Objects, Vectorize, AI
Gateway, R2, D1, Queues, Containers — at sub-50ms latency in 330+
cities.

Start a Cloudflare Discovery → 

See HotCopy in action → 

Pattern · A RAG 

Retrieval-Augmented Generation over private data

Hybrid semantic+keyword retrieval, AI-Gateway-fronted models,
R2-backed source corpus, audit trail end to end.

// Use case: agency knowledge bases, contract repositories,
customer support corpora

Worker Edge entry 

→ 

Vectorize Hybrid search 

→ 

AI Gateway Guardrails 

→ 

Workers AI Inference 

Backed by 
R2 corpus 
D1 metadata 
Audit log 

Pattern · B AGENT 

Stateful agent with tool use

Each agent is a Durable Object with its own SQL state and
lifecycle. MCP tool surface for clean external integrations. Hibernates when
idle.

// Use case: customer agents, internal assistants, long-running
automations

Agents SDK Lifecycle 

↔ 

Durable Object SQL · Memory 

↔ 

MCP Tool surface 

Calls out to 
Workers AI 
OpenAI / Anthropic / Gemini 
External tools 

Pattern · C RLM 

Recursive multi-agent orchestrator

Kimi k2.6 (1T MoE) as the orchestrator; small Gemma scout workers
run decomposed sub-tasks. Derived from MIT CSAIL's Recursive
Language Models work (arXiv:2512.24601). This is the HotCopy
architecture.

// Use case: code transformation, deep document understanding,
complex multi-step research

Kimi k2.6 Orchestrator · 1T MoE 

Spawns 
Scout · Gemma 
Scout · Gemma 
Scout · Gemma 
+N 

Substrate 
Workers AI 
Durable Objects 
Queues 

Pattern · D FED 

Federal-friendly edge deployment

Same Cloudflare primitives, hardened: AI Gateway guardrails,
region-pinning for data sovereignty, immutable audit logging,
role-based access, FedRAMP -aware deployment patterns.

// Use case: agency RAG, intra-agency assistants, OMB AI Action Plan rollouts

Zero Trust Auth 

→ 

Worker Region-pinned 

→ 

AI Gateway Guardrails · log 

→ 

Workers AI Inference 

Audit + sovereignty 
Immutable log → R2 
Region-pin · US 
Role-based access 
CMMC posture 

§ 04 / Live proof 

Run a Worker. From the closest edge.

This page hits a Truvisory®-deployed Cloudflare Worker on first
paint. The latency you see is your latency to the closest of 330+
Cloudflare data centers — typically the same region you live in.

→ Edge ping with location, RTT, and colo

→ Live JSON from a Workers AI inference call

→ Source on GitHub — verbatim, deployable in 90 seconds

View source → 

Start a Discovery → 

truvisory-edge.workers.dev 
● live 

// 1. Ping the closest Cloudflare edge 

› curl https://edge.truvisory.com/whoami

// 2. Inference on Workers AI · Llama 3.3 

› POST /infer 

→ "CMMC L2 = 110 NIST 800-171 controls, third-party assessed
for CUI." 

// model: @cf/meta/llama-3.3-70b · 41ms p50 · cached:
false 

› 

Today

Cloudflare-Native Engineer

What we are right now.

Daily-driver builder on Workers, Workers AI, Durable Objects,
Vectorize, AI Gateway, R2, D1, Agents SDK, MCP. Production
deployments on file (HotCopy, PresEngage). The
engineer-on-the-platform claim is true and defensible.

Pursuing

▣ Cloudflare ASDP · Application Services

What we're earning, not claiming.

Cloudflare's ASDP designation requires rigorous technical
validation of security, performance, and reliability. We're in the
process — and we won't surface a partner badge on the site that
hasn't been earned. When it lands, you'll see it.

### /proof/

§ Proof / Receipts 

The work, the numbers , the
operator résumé behind it.

Three categories: shipped products under the Truvisory® brand,
commercial engagements, and the 25-year operator track that gives the
whole thing weight.

§ trust 
★★★★★ 
5.0 / 5 
47 verified five-star reviews
on Google 

§ Operator timeline 

25 years of P&L, ops, and shipping. The AI practice sits on top
of it.

// Education: U. of Denver, Executive MBA · Daniels (2015 – 17).
Galvanize Web Dev & SWE Immersive (2018 – 19). U. of Denver, BS
Biochem & Biology (2011 – 14).

§ Brands worked alongside 

Operator résumé, in nameplates.

// Logos shown represent operator roles or engagements. They are not
endorsements of Truvisory®'s current AI practice.

### /trust/

§ Trust / Compliance posture 

Earned, not claimed .

A live status board of every certification, framework alignment, and
verifiable code we hold — current, in-progress, or planned. If a badge
isn't here, we don't display it.

§ 01 / Status board 

Where we are this quarter.

§ 02 / Verify the codes 

Don't take our word for it.

Every code below links to its official source of record. Federal
contracting officers and prime evaluators can verify the
registrations end-to-end without leaving SAM.gov.

Download Capability Statement ↓ 

Contracting officer line → 

// Truvisory® LLC · public registrations

§ 03 / Data & security ledger 

How customer data is handled, in plain English.

§ 04 / Subprocessors 

The vendors in the loop.

Responsible disclosure

If you've discovered a security issue that affects Truvisory® or a
customer environment we operate, write to security@truvisory.com . PGP key on request. We acknowledge within one business day and
target a 90-day fix-or-public-disclosure window.

Contracting questions

For COs, primes, evaluators, or anyone needing a SIG-Lite, COI, NDA
template, or specific certification artifact: contracting@truvisory.com . Single human inbox. 24-hour reply window.

### /about/

§ About / Founder 

Combat veteran. Two-decade operator. Cloudflare-native engineer.

Truvisory® is one person and the partners that person chooses, on
purpose. The whole point is that the principal you talk to is the
principal who builds.

Talk to Tony directly → 

See the operator timeline → 

-->

// Tony Adams · Founder 
Sgt (E-5) · Infantry · Afghanistan 

§ 01 / Manifesto 

Most AI consulting sells you a roadmap . We'd rather sell you the system .

The mid-market and federal-mission worlds are sitting on a stack of
AI strategy decks that nobody knows how to operationalize. Three
vendors pitched. Five frameworks were considered. A 90-page roadmap
was delivered. And no working system shipped.

That gap — between the deck and the system — is the entire reason
Truvisory® exists. We are explicitly not a strategy firm. We are a
build shop with a strong opinion that the strategy is mostly
already obvious to the operator, and what's actually missing is
somebody who can ship.

The operating thesis: AI is now infrastructure. It's
edge-deployable, pay-per-inference, and frequently boring once
architected correctly. That favors operators who can size a problem
in P&L 
terms, engineers who live on the platform, and teams small enough to
hold the whole system in their heads.

That's the company. That's why every page on this site links back
to either a working system, a number you can verify, or a person
you can email directly. No SDR drip. No "discovery deliverables."
No retainer trap.

§ 02 / Operating principles 

Six rules. Hard-coded.

§ 03 / Founder

Tony Adams, MBA 

Combat veteran (U.S. Army Infantry Team Leader & Combat Medic,
Sergeant E-5, Afghanistan). 25-year multi-exit operator across
PE-backed franchising, mortgage, and salon ops. Executive MBA from
Daniels — University of Denver. Galvanize Web Dev & SWE Immersive. BS Biochem & Bio, also Denver. Lives in the Denver metro.

The operator résumé reads MapMatix ( CTO , today), Daddy's Chicken Shack (President), Port of Subs (CTO),
Area 15 Ventures ( VP / Chief of Staff), Motto Mortgage Plus ( COO ), RE/MAX HQ (Head of RR&CX), and Do the Bang Thing Salon
(co-founded, 9-year run). Then
PresEngage and Truvisory® on top. The intersection — operator who
builds — is the entire point.

§ 04 / Why now

The market has changed. Most firms haven't.

The traditional consulting playbook was: scope a 12-week strategy
engagement, hand the client a deck, and earn the implementation
contract on rebound. That model assumed AI was a strategy problem.

It isn't anymore. As of 2026, the platform questions are largely
settled (edge inference is cheaper and faster), the architecture
patterns are public ( RLM , Agents SDK, MCP ), and the model markets are competitive enough that picking the
wrong LLM is recoverable in a week.

What's left scarce is people who can hold the operator's P&L
and the production engineering in the same head, deliver in 30–90
days, and not need to staff up a 12-person team to do it. That's
the niche. Everything else on this site flows from that.

§ 05 / What we're reading 

The papers and posts that shape how we build.

§ 06 / Two doors 

If something on this page resonates, we should talk.

There are exactly two ways to start a conversation. Federal mission
owners and contracting officers go through one door; mid-market
commercial operators go through the other. Same person picks up.

Federal · Contracting → 

Commercial · Book AI Audit →

### /contact/

§ Contact / Two doors 

Two doors. One principal .

Federal contracting officers, primes, and mission owners go through the
left door. Mid-market commercial operators go through the right. Same
person reads both inboxes. 24-hour reply window.

▶ Door A · Federal & contracting 

Contracting officers, primes, evaluators.

For RFI/RFQ/RFP responses, capability briefings, sources-sought
replies, teaming inquiries, NDA & SIG-Lite requests, or any
artifact your acquisition team needs.

Name 

Title 

Agency / Prime 

.gov / .mil email 

Vehicle / inquiry type 

Sources sought 
RFI / RFQ / RFP 
Sole-source SDVOSB inquiry 
Teaming under prime IDIQ 
Capability briefing request 
NDA / SIG-Lite / COI request 
Other 

Solicitation # (optional) 

Mission context 

Send to contracting inbox → 

Direct line 

📧 contracting@truvisory.com 

📄 Capability Statement (PDF) 

// UEI KNZKX28MLC42 · CAGE 9X1L4 · SDVOSB verified

▶ Door B · Commercial & mid-market 

Founders, operators, COOs, fractional buyers.

For the 30-minute AI Audit, the Ship-It Sprint, or a working call
about whether the embedded fractional CTO model fits your stage. No
SDR. No drip campaign.

Name 

Email 

Company 

Role 

What are you exploring? 

AI Audit · 2-week, $5–30K 
Ship-It Sprint · 4–6 week, $20–120K 
Embedded Fractional CTO · monthly 
Cloudflare migration / architecture 
Just exploring — book a call 

Company stage 

$1M – $10M revenue 
$10M – $50M revenue 
$50M – $250M revenue 
$250M+ revenue 
Pre-revenue / venture 

The one process you wish were automated 

Send to commercial inbox → 

Or skip the form 

book a 30-min AI Audit directly 

📧 consulting@truvisory.com 

// 24-hour written reply guarantee. No drip campaign, ever.

§ Or skip the form entirely 

Pick a time. Done.

// CALENDAR 

Book a 30-minute call

Working call, not discovery call. Bring one process you wish were
automated; we bring a working hypothesis, a stack pick, and a
fixed-scope ballpark. Federal & commercial both routed to Tony
directly.

// 24h reply · no SDR drip · single-step booking 

Open booking calendar → 

§ FAQ 

The questions that come up before the call.

### /capability-statement/

// Capability Statement · v2026.05 · ← back to Federal 

Download PDF ↓ 

Capability Statement · 2026

SDVOSB · Verified 
UEI KNZKX28MLC42 
CAGE 9X1L4 
SAM.gov Active 
Denver, CO

A combat-veteran-led, Cloudflare-native AI & automation firm
built to ship — not to recommend — production AI systems for federal
mission owners.

★★★★★ 
5.0 / 5 
47 verified five-star reviews on
Google 
// Independent · public · verifiable 

Core Competencies 

Differentiators 

Service Lines 

Company Data 

NAICS Codes

Vehicles · Posture 

// truvisory.com · contracting@truvisory.com 
Page 1 / 2 

Capability Statement · 2026 · Past Performance

Contracting POC 
Tony Adams, Founder 
tony@truvisory.com 
+1 (303) 495-5859 
Denver, CO

Past performance & operator-track receipts

{/* Past-performance grid → real . This is a single
auto-flow 3-col grid (NOT a row-of-grids like the federal
table): the scoped

### /ai/

§ For agents and machines 

Truvisory is AI-ready .

Every page links to a complete map of what this site can do for an
agent — discovery files, API specs, runtime tools, and content-signal
preferences — published against open standards (RFC 8288, RFC 9727,
llmstxt.org, IAB Content Signals, MCP). Crawlers and agents can fetch
each artifact directly from the URLs below.

§ 01 / Discovery artifacts 

Five files. Everything an agent needs.

§ 02 / API documentation 

Four endpoints, OpenAPI 3.1.

§ 03 / In-browser tools 

WebMCP runtime tools.

If the browser supports the WebMCP draft via
navigator.modelContext 
(Chrome 142+ Canary as of 2026), Truvisory registers four tools at
page load. Agents can call them in-place without negotiating a
transport.

Probe from DevTools:
navigator.modelContext.getTools().then(t console.log(t.map(x x.name))) 

// navigator.modelContext.provideContext( )

§ 04 / Standards complied with 

Open formats, every one of them.

Cloudflare's Markdown for Agents is also
enabled on this zone — sending Accept: text/markdown 
to any URL returns the page as Markdown, transformed at the edge.

### /privacy-policy/

Truvisory ("Company," "us," "we," or "Truvisory, LLC") recognizes the
importance of your privacy. This Privacy Policy ("Policy") discloses the
privacy practices for this website ("Site") as well as related products and
services we may offer to you (collectively referred to as the "Services").
This Policy also covers how personal and other information that we receive or
collect about you is treated. Please read the information below to learn the
following regarding your use of this Site. This Policy is designed to be read
in connection with the Site Terms of Use, which is available here: Terms of Use ("Terms"). By accessing or using this Site, you agree to be bound by the Terms
and this Policy.

We reserve the right to change this Policy from time to time. We will notify
you about significant changes in the way we treat personal information by
sending a notice to the primary email address specified in your account, by
placing a prominent notice on the Site, or by updating any privacy
information in this Policy. Your continued use of the Site and/or Services
after such modifications will constitute your acknowledgment of the modified
Policy and your agreement to abide and be bound by that Policy.

If you have any questions about this Policy, please contact us. 

BY USING THE SITE AND/OR SERVICES, YOU GIVE YOUR CONSENT THAT ALL PERSONAL
DATA THAT YOU SUBMIT MAY BE PROCESSED BY US IN THE MANNER AND FOR THE
PURPOSES DESCRIBED IN THIS POLICY. IF YOU DO NOT AGREE TO THE TERMS OF THIS
POLICY, DO NOT USE THE SITE.

Types of Information We Collect 

In order to better provide you with the Site and Services, we collect two
types of information about our users: Personally Identifiable Information
("PII") and Aggregate Information. PII refers to information that is specific
to you individually. When you engage in certain activities on the Site, such
as registering for an account, purchasing a Service, submitting content,
and/or sending us feedback, we may ask you to provide certain information
about yourself. Examples of PII include your first and last name, email
address, telephone number, and other identifying information. Aggregate
Information refers to information that does not by itself identify a specific
individual. We gather certain information about you based upon where you
visit on the Site and what other sites may have directed you to us. This
information, which is collected in a variety of different ways, is compiled
and analyzed on both a personal and an aggregated basis. This information may
include the Uniform Resource Locator ("URL") of the website you just came
from, which URL you go to after visiting the Site, what browser you are
using, and your Internet Protocol ("IP") address.

How We Collect and Use Information 

We do not collect any PII about you unless you voluntarily provide it to
us. However, you may be required to provide certain PII to us when you elect to
use certain Services available on the Site. These may include: (a) registering for an account
on the Site; (b) signing up for Services; (c) sending us an email message; (d)
submitting a form or transmitting other information; or (e) submitting your credit
card or other payment information. We will primarily use your PII to provide the
Site and offerings to you. We will also use certain forms of PII to enhance the
operation of the Site, improve our internal marketing and promotional efforts,
statistically analyze Site use, improve our offerings, and customize the Site's
content and layout. We may use PII to deliver information to you and to contact
you. Finally, we may use your PII to resolve disputes, troubleshoot problems,
and enforce our agreements with you, including our Terms and this Policy. We
and our third party partners may also collect certain Aggregate Information.
For example, we may use your IP address to diagnose problems with our
servers, software, to administer the Site, and to gather demographic
information.

Cookies 

Depending on how you use the Site and Services, we may store cookies on your
computer or device in order to collect certain aggregate data and to
customize certain aspects of your specific user experience. A cookie is a
small data text file which is stored on your computer or device that uniquely
identifies you. Cookies may also include more personalized information, such
as your IP address, browser type, the server you are logged onto, the area
code and zip code associated with your server, and your name. We may use
cookies to perform tasks such as monitoring aggregate usage metrics, storing
and remembering your passwords (if you allow us to do so), storing your
preferences, and personalizing the Site and/or Services for you. We may also
use an outside advertising partner who may place a separate cookie on your
computer or device. We will not provide any third-party advertising partners with any of your
PII. 

Google Analytics 4 + Cloudflare Zaraz. 
We use Google Analytics 4 to understand how visitors discover and use this
site. Events are delivered server-side through Cloudflare Zaraz, which means
a Cloudflare edge worker forwards the data to Google on our behalf — your
browser does not load Google's analytics script directly. Your IP address is
forwarded to Google with the IP-anonymization flag set, so Google determines
approximate country, region, and city for geographic reporting and then
masks the address before storage; we never see or store your full IP. We do
not transmit names, email addresses, message bodies, or any other personally
identifying information through analytics. The data we collect is limited
to page views, click and form-submission counts, anonymous chat-engagement
milestones, video-play counts, and a randomly-generated client identifier
used to stitch together a session. To opt out you may use a browser-level
tracking-prevention setting or an extension such as the Google Analytics
opt-out add-on. This site does not present a cookie or consent banner.

Release of Information 

We will not sell, trade, or rent your PII to others. We do provide
some of our product and service offerings through contractual arrangements made
with affiliates, service providers, partners, and other third parties (collectively,
"Service Partners"). We and our Service Partners may need to use some PII in order
to perform tasks between our respective sites or to deliver services to you. The
use of your PII by our Service Partners is governed the respective privacy policies
of those Service Partners and is not subject to our control. Except as otherwise
discussed in this Policy, this document only addresses the use and disclosure
of information we collect from you. Other websites accessible through this
Site, including our Service Partners, have their own privacy policies and data
collection, use, and disclosure practices. Please consult each site's privacy
policy. We are not responsible for the policies or practices of third parties.
If we are required by law enforcement or judicial authorities to provide PII
to governmental authorities, we may disclose PII upon receipt of a court
order, subpoena, or to cooperate with a law enforcement investigation. We
reserve the right to report to law enforcement agencies any activities that
we in good faith believe to be unlawful. We may also provide Aggregate
Information to third parties but this Aggregate Information does not include
any PII.

Updating and Correcting Information 

You may change any of your PII in your account online at any time by
accessing your account in accordance with instructions posted on the Site.
You may also access and correct your personal information and privacy
preferences by emailing us at tony@truvisory.com or in writing to 2696 W. Grand Ave., Littleton, CO 80123. Please include your
name, address, and/or email address when you contact us. We encourage you to promptly update your PII if it changes. You
may ask to have the information on your account deleted or removed; however,
some information, such as past transactions, logs, or other information may
not be deleted. In addition, it may be impossible to completely delete your
information without some residual information because of backups.

Security of Your PII 

We take appropriate security measures to protect your PII and to prevent
unauthorized access to or unauthorized alteration, disclosure, or destruction
of your PII. We only use your PII for the purposes for which it was collected
or to comply with any applicable legal or ethical reporting or retention
requirements. We limit access to PII only to specific employees, contractors,
and agents who have a reasonable need to access your information. Credit card
transactions and order fulfillment processed through the Site are handled by
established third party banking institutions and processing agents.
Unfortunately, no data transmission over the Internet or any wireless network can be
guaranteed to be 100% secure. As a result, while we strive to protect your PII, you acknowledge that: (a)
there are security and privacy limitations inherent to the Internet which are
beyond our control; and (b) the security, integrity, and privacy of any and all
information and data exchanged between you and us cannot be guaranteed.

Minors 

You must be at least 18 years old to have our permission to use this
Site. We do not knowingly collect, use, or disclose PII about minors. If you are between
the ages of 13 and 17, you may use the Site with the express permission of your
parent or legal guardian. Your parent or legal guardian controls your PII and
we will respond to communications from your parent or legal guardian regarding
your use of the Site.

Miscellaneous 

Please consult our Site Terms for other policies
regarding your use of the Site. If you have any questions, concerns, or
inquiries about this Policy, or our use of your PII, or our privacy practices,
please contact us at tony@truvisory.com , if by email, or 2696 W. Grand Ave., Littleton, CO 80123, if by conventional
mail.

### /terms-of-use/

Acceptance of Terms 

Truvisory ("Company," "us," "we," or Truvisory, LLC), provides the
https://truvisory.com website ("Site") subject to your compliance with the
following Terms of Use ("Terms"), as well as any other written agreement(s)
between us and you. We reserve the right to change these Terms from time to
time with or without notice to you. You acknowledge and agree that it is your
responsibility to periodically review this Site and these Terms. Your
continued use of this Site after such modifications will constitute
acknowledgement and acceptance of the modified Terms. As used in these Terms,
references to our "Affiliates" include our owners, licensees, assigns,
subsidiaries, affiliated companies, officers, directors, suppliers, partners,
sponsors, advertisers, and includes all parties involved in creating,
producing, and/or delivering this Site and/or contents available on this
Site.

BY USING THIS SITE, YOU AGREE TO BE BOUND BY THESE TERMS. IF YOU DO NOT WISH
TO BE BOUND BY THESE TERMS, PLEASE DO NOT USE THIS SITE. YOUR SOLE REMEDY FOR
DISSATISFACTION WITH THIS SITE OR PRODUCTS OR OFFERINGS AVAILABLE ON THIS SITE
OR THESE TERMS IS TO CEASE USING THIS SITE.

Temporary Interruptions 

You understand and agree that temporary interruptions of this Site may occur
as normal events that are out of our control. You also understand and agree
that we have no control over the third-party networks or services that we may
use to provide this Site. You agree that this Site is provided "as is" and
that we assume no responsibility for the timeliness, deletion, mis-delivery,
or failure to store any user communications or other information.

Payment 

This Site does not process credit cards or take other payment processing
information. Payment processing is handled through third-party services. We
are therefore not liable or responsible for your payment interactions on this
Site. Charges may be billed in advance of service.

Overdue Amounts 

If, for any reason, your credit card company declines or otherwise refuses to
pay the amount owed for your purchase, you agree that we may, at our option,
suspend or terminate your purchase and may require you to pay any overdue
amounts incurred (including any third-party chargeback fees or penalties) by
other means acceptable to us. In the event legal action is necessary to
collect on balances due, you agree to reimburse us for all expenses incurred
to recover sums due, including attorneys fees and other legal expenses.

Third-Party Sites and Information 

This Site may redirect or link to other websites on the Internet, or may
otherwise include references to information, products, or services made
available by unaffiliated third parties. While we make every effort to work
with trusted, reputable providers, from time to time such sites may contain
information, material, or policies that may be incorrect or found to be
objectionable. You understand that we are not responsible for the accuracy,
completeness, appropriateness, or legality of content hosted by third party
websites, nor are we responsible for errors or omissions in any references
made on those websites. The inclusion of such a link or reference is provided
merely as a convenience and does not imply endorsement of, or association
with the site or party by us, or any warranty of any kind, either express or
implied.

Content 

For purposes of these Terms, "content" is defined as any information,
communications, software, published works, photos, video, graphics, music,
sounds, or other material that can be viewed by users on our Site and is
owned by Company or its Affiliates. By accepting these Terms, you agree that
all content presented to you on this Site is protected by any and all
intellectual property and/or other proprietary rights available within the
United States and is the sole property of Company or its Affiliates.

All custom graphics, icons, logos, and service names are registered
trademarks, trademarks, or service marks of Company or its Affiliates. All
other trademarks or service marks are property of their respective owners.
Nothing in these Terms grants you any right to use any trademark, service
mark, logo, and/or the name of Company or its Affiliates.

Limitations on Use of Content 

Except for a single copy made for personal use, you may not copy, reproduce,
modify, republish, upload, post, transmit, or distribute any content from
this Site in any form or by any means whatsoever without prior written
permission from us. Any unauthorized use of Site content violates our
intellectual property interests and could result in criminal or civil
penalties. Neither we nor our Affiliates warrant or represent that your use
of materials displayed on, or obtained through, this Site will not infringe
the rights of third parties.

Privacy & Security 

In order to access some of this Site, you may be asked to set up an account
and password. Our account registration page requests certain personal
information from you ("Registration Info"). You will have the ability to
maintain and periodically update your Registration Info as you see fit. By
registering, you agree that all information provided by you as Registration
Info is true and accurate and that you will maintain and update this
information as required in order to keep it current, complete, and accurate.
If you register for an account on this Site, you agree that you are
responsible for maintaining the security and confidentiality of your password
and that you are fully responsible for all activities or charges that are
incurred under your account. Therefore, you must take reasonable steps to
ensure that others do not gain access to your password or account.

Disclosure to Third Party Affiliates 

You hereby grant us the right to disclose to third parties certain
Registration Info about you. The information we obtain through your use of
this Site, including your Registration Info, is subject to our Privacy Policy and is specifically incorporated by reference into these Terms.

Disclaimer 

ALL CONTENT ON THIS SITE IS PROVIDED ON AN "AS IS" AND "AS AVAILABLE" BASIS
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE OR THE WARRANTY OF NON-INFRINGEMENT. WITHOUT LIMITING THE
FOREGOING, WE MAKE NO WARRANTY THAT (A) THE CONTENT WILL MEET YOUR
REQUIREMENTS, (B) THE CONTENT OR SITE WILL BE UNINTERRUPTED, TIMELY, SECURE,
OR ERROR-FREE, (C) THE RESULTS THAT MAY BE OBTAINED FROM THE USE OF THIS SITE
WILL BE EFFECTIVE, ACCURATE, OR RELIABLE, OR (D) THE QUALITY OF ANY CONTENT ON
THIS SITE WILL MEET YOUR EXPECTATIONS OR BE FREE FROM MISTAKES, ERRORS, OR
DEFECTS.

THIS SITE COULD INCLUDE TECHNICAL OR OTHER MISTAKES, INACCURACIES, OR
TYPOGRAPHICAL ERRORS. WE MAY MAKE CHANGES TO THE CONTENT, INCLUDING THE PRICES
AND DESCRIPTIONS OF ANY PRODUCTS LISTED HEREIN, AT ANY TIME WITHOUT NOTICE.
THE CONTENT AVAILABLE AT THIS SITE MAY BE OUT OF DATE AND WE MAKE NO
COMMITMENT TO UPDATE SUCH CONTENT. THE USE OF THIS SITE IS DONE AT YOUR OWN
DISCRETION AND RISK AND WITH YOUR AGREEMENT THAT YOU WILL BE SOLELY
RESPONSIBLE FOR ANY DAMAGE TO YOUR COMPUTER OR DEVICE OR LOSS OF DATA THAT
RESULTS FROM SUCH ACTIVITIES. WE MAKE NO WARRANTY REGARDING ANY TRANSACTIONS
EXECUTED THROUGH A THIRD PARTY OR IN CONNECTION WITH THIS SITE AND YOU
UNDERSTAND AND AGREE THAT SUCH TRANSACTIONS ARE CONDUCTED ENTIRELY AT YOUR OWN
RISK. ANY WARRANTY THAT IS PROVIDED IN CONNECTION WITH ANY CONTENT AVAILABLE
ON OR THROUGH THIS SITE FROM A THIRD PARTY IS PROVIDED SOLELY BY SUCH THIRD
PARTY, AND NOT BY US OR ANY OTHER OF OUR AFFILIATES. WE RESERVE THE SOLE RIGHT
TO MODIFY OR DISCONTINUE THIS SITE, INCLUDING ANY OFFERINGS OR FEATURES
THEREIN, AT ANY TIME WITH OR WITHOUT NOTICE TO YOU. WE SHALL NOT BE LIABLE TO
YOU OR ANY THIRD PARTY SHOULD WE EXERCISE SUCH RIGHT. ANY NEW FEATURES THAT
AUGMENT OR ENHANCE THE THEN-CURRENT SITE SHALL ALSO BE SUBJECT TO THESE TERMS.
SOME STATES OR JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF CERTAIN WARRANTIES,
SO SOME OF THE ABOVE LIMITATIONS MAY NOT APPLY TO YOU. PLEASE CONSULT THE LAWS
IN YOUR JURISDICTION.

Limitation of Liability & Indemnification 

IN NO EVENT SHALL WE OR OUR AFFILIATES BE LIABLE TO YOU OR ANY THIRD PARTY FOR
ANY SPECIAL, PUNITIVE, INCIDENTAL, INDIRECT, OR CONSEQUENTIAL DAMAGES OF ANY
KIND, OR ANY DAMAGES WHATSOEVER, INCLUDING, WITHOUT LIMITATION, THOSE
RESULTING FROM LOSS OF USE, DATA, OR PROFIT, WHETHER OR NOT WE HAVE BEEN
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES, AND ON ANY THEORY OF LIABILITY,
ARISING OUT OF OR IN CONNECTION WITH THE USE OF THIS SITE OR OF ANY WEBSITE
REFERENCED OR LINKED TO FROM THIS SITE. FURTHER, WE SHALL NOT BE LIABLE IN ANY
WAY FOR THIRD PARTY PROMISES REGARDING THIS SITE OR FOR ASSISTANCE IN
CONDUCTING COMMERCIAL TRANSACTIONS WITH THE THIRD PARTY THROUGH THIS SITE
INCLUDING, WITHOUT LIMITATION, THE PROCESSING OF ORDERS. SOME JURISDICTIONS
PROHIBIT THE EXCLUSION OR LIMITATION OF LIABILITY FOR CONSEQUENTIAL OR
INCIDENTAL DAMAGES, SO THE ABOVE LIMITATIONS MAY NOT APPLY TO YOU. PLEASE
CONSULT THE LAWS IN YOUR JURISDICTION.

Indemnification 

You agree to defend, indemnify, and hold us and our Affiliates harmless from
all liabilities, claims, and expenses, including attorneys' fees, that may
arise from your use or misuse of this Site. We reserve the right, at our own
expense, to assume the exclusive defense and control of any matter otherwise
subject to indemnification by you, in which event you will cooperate with us
in asserting any available defenses.

Termination of Use 

You agree that we may, at our sole discretion, terminate or suspend your
access to all or part of this Site with or without notice and for any reason
including, without limitation, breach of these Terms. Upon termination and
regardless of the reason(s) motivating such termination, your right to use
this Site will immediately cease. We shall not be liable to you or any third
party for any claims for damages arising out of any termination or suspension
or any other actions taken by us in connection therewith.

International Use 

Although this Site may be accessible worldwide, we make no representation that
this Site is or will be appropriate or available for use in locations outside
the United States. Those who choose to access this Site from other locations
do so at their own risk. If you choose to access this Site from outside the
United States, you are responsible for compliance with local laws in your
jurisdiction including, but not limited to, the taxation of products
purchased over the Internet.

Governing Law 

This Site (excluding any Third Party websites) is controlled by us from our
offices in Littleton, Colorado and the statutes and laws of the State of
Colorado shall be controlling, without regard to the conflicts of laws
principles thereof. You agree and hereby submit to the exclusive personal
jurisdiction and venue of the courts located in Littleton, Colorado.

Notices 

All notices to a party shall be in writing and shall be made either via email
or conventional mail. Notices to us must be sent to tony@truvisory.com , if by email, or to our address at 2696 W. Grand Ave., Littleton, CO 80123
if in hard copy. You agree to allow us to submit notices to you either through
the email address provided or to the address we have on record.

No Resale Right 

You agree not to sell, resell, reproduce, duplicate, distribute, copy, or use
for any commercial purposes any portion of this Site beyond the limited rights
granted to you under these Terms.

Force Majeure 

In addition to any excuse provided by applicable law, we shall be excused from
liability for non-delivery or delay in delivery of this Site arising from any
event beyond our reasonable control, whether or not foreseeable by either
party including, but not limited to: labor disturbance, war, fire, accident,
adverse weather, inability to secure transportation, governmental act or
regulation, and other causes or events beyond our reasonable control, whether
or not similar to those which are enumerated above.

Savings Clause 

If any part of these Terms is held invalid or unenforceable, that portion
shall be construed in a manner consistent with applicable law to reflect, as
nearly as possible, the original intentions of the parties, and the remaining
portions shall remain in full force and effect.

No Waiver 

Any failure by us to enforce or exercise any provision of these Terms or
related rights shall not constitute a waiver of that right or provision.

Entire Agreement 

These Terms constitute the entire agreement and understanding between you and
us concerning the subject matter hereof and supersede all prior agreements and
understandings between us with respect thereto. If any provision of these
Terms is held to be ineffective, unenforceable, or illegal for any reason, we
may reform such provision to the extent necessary to make it effective,
enforceable, and legal or such provision may be deemed severed and in either
case these Terms with such provision reformed or severed shall remain in full
force and effect to the fullest extent permitted by law. Our failure to
enforce any part or portion of these Terms shall not be considered a waiver of
such portion of these Terms. These Terms may not be altered, supplemented, or
amended by the use of any other document(s) other than as described above. To
the extent that anything in or associated with this Site is in conflict or
inconsistent with these Terms, these Terms shall take precedence.

## Insights

### At 5% GPU utilization, the math doesn't work. Here's what does. — /insights/gpu-math/

The Cast AI 2026 State of Kubernetes Optimization Report quietly buried the most important number in the AI infrastructure conversation this year.

Across roughly 23,000 production clusters running on AWS, Azure, and Google Cloud, average GPU utilization is **5%**.

Not p10. Not "some teams." Not "before they tuned it." **Average.** Across the production fleet of the enterprise AI buildout. CPU utilization is 8%, down from 10% a year earlier. Memory utilization fell from 23% to 20%. The trend lines are pointing the wrong direction, and the GPU number is the one that will eat balance sheets alive.

> If you are paying for reserved GPU and using 5% of it, you are buying a Ferrari to commute three miles, twice a week, with one passenger.

The reserved-capacity model that the entire enterprise AI stack was sold under — the model that justified the eleven-figure capex commitments Amazon, Microsoft, Alphabet, and Meta announced for 2026 — is, on the math, mostly empty space being expensed against P&L. The Cast AI team called the bottom line on this directly in their write-up: at an average utilization of 5%, the math doesn't work, and the hoarding instinct that holds capacity you might not get back is what feeds the scarcity loop that drives prices higher in the first place.

This is the part of the AI infrastructure conversation that VCs, hyperscalers, and platform vendors don't want named out loud, because the answer isn't "we need better Kubernetes." The answer is that the reserved-capacity model is structurally broken for the workloads we're actually running. Pay-per-inference architecture isn't a niche cost-saver — it's the only model that survives a CFO doing the math at the next quarterly review. And the architectural pattern that makes pay-per-inference work — orchestrator-plus-scout, RLM-style, stateless inference with stateful coordination — is the same pattern that lets a principal-led team ship multi-agent systems that used to need a twelve-person infrastructure org.

Let me show you the math, the architecture, and the federal angle.

## How we got here

Reserved GPU made sense in 2023. There was no other game in town.

If you weren't on a yearly H100 contract by Q2, you weren't shipping models. Capacity was scarce, the queue was long, and your competitor had a procurement officer on a first-name basis with somebody at the cloud provider's account team. You bought big. You sized for the peak workload you might run next quarter, not the median workload you ran today. You sized to never be caught short. The capacity sat there ready, and the math was: a 60% utilized GPU you own beats an unavailable GPU you don't.

That math held until two things happened simultaneously. First, the models got dramatically better at being smaller. Gemma, Llama 3, Nemotron, Mistral Small — the gap between a frontier model and a 7B-to-30B model on most enterprise workloads narrowed faster than anyone forecasted. Second, the inference-time tooling caught up. Speculative decoding, FlashAttention variants, paged KV cache, batched serving — the throughput per GPU-hour on production hardware roughly tripled between 2023 and 2025.

Both of those should have driven utilization _up_. Instead, utilization fell. Why?

Because every gain on the inference side made it easier for teams to provision more headroom, not the same headroom more efficiently. The infrastructure team shipped capacity to the application team. The application team built features against that capacity. Both teams optimized for "never run out." Nobody got paid to leave GPUs unprovisioned. And the application teams _still_ don't know what their median load looks like, because the product is changing every two weeks and the workload mix is changing every quarter.

The result is what Cast AI's data shows: organizations assigning roughly twenty times more GPU capacity than they actively use. Ninety-five percent of GPU capacity, on average, sitting idle. Not for a bursty afternoon. For the year.

## The CFO arithmetic

An idle CPU core costs cents per hour. An idle GPU costs dollars.

In January 2026, AWS raised H200 Capacity Block prices by 15%. That is the part to sit with. **GPU prices went up, not down.** For the first time since EC2 launched in 2006, the unit economics of cloud compute moved the wrong direction for the customer. Cast AI noted this broke a two-decade precedent. The implication isn't that AWS is being unreasonable — it's that the demand-supply mismatch is structural now, and the price will continue to do what prices do in a constrained market: ration access through the wallet.

So let's do the math for a realistic mid-market workload.

You're running a customer-facing AI feature. Maybe it's an in-product copilot, maybe it's a document summarizer, maybe it's a workflow agent. Your engineering org reserved two H100 nodes — call it $30k/month all-in across compute, networking, observability, and the headcount allocated to keeping the cluster healthy. At 5% utilization, you are doing actual productive AI inference for about 36 hours per month per node. Your effective cost per productive hour is roughly $416. Your CFO is paying Ferrari rates for a Honda Civic's worth of trips, and the maintenance is on you.

Now run the same workload on a pay-per-inference platform. Workers AI on Cloudflare bills you for the tokens you actually generate. The hardware management surface goes to zero — you don't have a cluster, you don't have a queue, you don't have a node pool, you don't have a capacity planner. You write code that calls a model. The model runs in one of 330+ cities of presence, close to wherever the request came from. If you do no inference this hour, you pay zero. If you do a million inferences this hour, you pay for a million inferences and the platform handles the placement, batching, and hardware.

The unit economics flip. You no longer amortize a box. You match cost to value, request by request. The CFO's question goes from "are we using what we bought?" to "is each call worth what it costs?" — which is the question the CFO should have been asking all along. The CTO's question goes from "did we size the cluster right?" to "does this feature make money?" — which is also the question the CTO should have been asking all along.

The hyperscalers will tell you reserved capacity is cheaper per unit at scale. That is true in the same sense that a Costco-sized jar of mayonnaise is cheaper per ounce: it is, until you account for the mayonnaise that goes bad in your fridge. At 5% utilization, 95% of your GPU mayonnaise is going bad in your fridge.

## Why most teams can't migrate (yet)

If pay-per-inference is so much better, why isn't everyone already there?

Four reasons, in roughly the order they show up on a whiteboard when I'm consulting with a team that wants to move.

**One: sunk hardware.** A lot of teams signed multi-year reserved instances or made capex commitments to their cloud provider that they can't unwind without paying through the nose. The contract was a bet on continued scarcity, and the bet is now underwater. Migration plans get written, then shelved, because nobody wants to be the VP who walked away from $4M of pre-paid capacity even when continuing to use it costs more than walking away.

**Two: latency assumptions.** "We need our model running in our region for latency." That was a real constraint in 2023. It is mostly not a real constraint in 2026. The edge inference platforms — Workers AI specifically — run in hundreds of cities. The median first-token latency from a Workers AI inference is, in my testing on real production workloads, often _better_ than from a centralized cluster in the same continent, because the request hits a model running ten milliseconds away instead of in a single us-east-1 datacenter behind a load balancer behind an API gateway. Latency is no longer the moat people think it is.

**Three: vendor lock-in on inference.** A lot of teams built their first AI feature against OpenAI's API. Then they built the next one. Then the next one. Then they realized the bill was real and decided to "go self-hosted" for cost reasons, which is how they ended up with the 5%-utilized cluster in the first place. The migration to pay-per-inference requires a portable inference layer — which is exactly what MCP, OpenAI-compatible APIs, and standardized tool surfaces have quietly built over the last 18 months.

**Four: skill atrophy.** Your ML platform team has spent the last 18 months optimizing a stack that pay-per-inference dissolves. Kubernetes operators for GPU scheduling. Custom autoscalers. NCCL tuning. vLLM deployment pipelines. None of that is wrong — it's just not the work that pays the rent anymore for most product teams. The platform team rationally resists the architectural shift that makes their day-to-day expertise less central. This is a human problem, not a technical one. The technical answer is to redeploy the platform team against orchestration and evaluation, where there's still real engineering depth required. The human answer is to be honest about it.

None of those four reasons changes the math. They just slow the migration. The teams that move first capture a compounding advantage that the teams that move later have to pay to catch up to.

## What architecture survives this

Here's where it gets interesting, because the architectural answer to the 5% problem isn't "smaller GPUs" or "better autoscalers." It's a different shape entirely.

The pattern I keep returning to — the one HotCopy is built on, the one that lets me ship recursive multi-agent systems solo — is **orchestrator-plus-scout**. One bigger, more expensive model does the planning. Many smaller, cheaper models do the parallel scouting and execution. The orchestrator decomposes the problem, dispatches sub-problems to scouts running in parallel, and recombines their outputs.

The architectural lineage of this pattern is now formalized. In December 2025, Alex Zhang, Tim Kraska, and Omar Khattab at MIT CSAIL published the Recursive Language Models paper ([arXiv:2512.24601](https://arxiv.org/abs/2512.24601)), which treats the prompt itself as an external programmable environment that the model can decompose, examine, and recursively call itself over. The RLM paper demonstrates that this pattern handles inputs two orders of magnitude beyond model context windows and outperforms vanilla frontier LLMs on long-context tasks at comparable or lower per-query cost. The technical insight is significant in its own right. The infrastructure implication is the part most readers missed.

If you build an orchestrator-plus-scout system on reserved GPU, you are reserving capacity for the _peak parallel scout fan-out you might hit_. Which, given the variance in agent workloads, is enormous. You will provision for the worst case and run at single-digit utilization on the median case, and now you've reproduced the 5% problem inside your own architecture.

If you build the same system on pay-per-inference, **the architecture you wrote is the architecture you pay for**. Twenty scouts in parallel costs twenty units. Two scouts in parallel costs two units. The thing nobody told you about pay-per-inference is that it doesn't just change the cost model — it changes what architectures are economically viable. Recursive, fan-out, multi-agent patterns become _cheaper_ on pay-per-inference precisely because the platform absorbs the variance. On reserved capacity, those same patterns are economically punitive, which is why the agent frameworks that emerged in 2024 mostly ran sequentially against a single big model: the architecture was bent to fit the cost surface.

What you need underneath to make this work is three primitives. They exist. They're documented. They run today.

**Stateless inference at the edge.** Workers AI for the inference itself. The platform handles model placement, batching, and hardware. Your code doesn't know what GPU it's running on, and that's the point.

**Stateful coordination in lightweight durable runtimes.** Cloudflare Durable Objects, one per agent instance, each with its own embedded SQL database and hibernation semantics. The agent wakes when something happens, reads its durable state, does work, and hibernates when idle. You don't run servers. You don't reserve capacity. Each Durable Object stays alive as long as it's processing, then goes dormant. The Agent class in the Cloudflare Agents SDK is built directly on this primitive — DurableObject > Server > Agent — and it's the most important infrastructure design choice for multi-agent systems I've seen since Docker.

**Standardized tool surfaces.** Model Context Protocol (MCP) as the contract between your agent and whatever it's reaching out to — your CRM, your filesystem, your search index, your billing system. MCP shipped in November 2024 and the ecosystem moved fast: thousands of servers, SDKs in every major language, adoption by OpenAI and Google. The clean tool-surface contract is finally making "agent that uses your tools" boring, which is exactly what you want. Boring infrastructure is shippable infrastructure.

That's the stack. Stateless inference. Stateful coordination. Standard protocol surfaces. Three layers, three primitives, one principal can hold the whole thing in their head.

## Why solo and small teams win this

The 12-person infrastructure org that justified itself in 2023 is now a tax on the P&L for a lot of mid-market companies.

This isn't a knock on infrastructure engineers — I've been one, I've hired them, I've run teams of them. It's an observation about what the work actually looks like in 2026 versus 2023. In 2023, somebody had to set up the Kubernetes cluster, write the GPU operator, tune the vLLM deployment, build the autoscaler, write the observability pipeline, manage the model registry, design the inference gateway, build the queueing layer. That was real, hard, valuable work. It justified a team.

In 2026, on a pay-per-inference + edge-stateful-runtime + MCP stack, most of those layers are either provided by the platform or replaced by a primitive that does the same job in one config file. A principal engineer who knows the platform can stand up a multi-agent system in two weeks that would have taken a team a quarter in 2023. The ratio of engineering output to engineering headcount has moved by a factor most companies haven't internalized yet.

This is what HotCopy is. A managed recursive AI coding CLI, built on the orchestrator-plus-scout pattern, running on a pay-per-inference inference layer with stateful coordination in Durable Objects. The infrastructure is invisible to the user, because the user doesn't have any. There's no cluster. There's no queue. There's no node pool. There's a CLI that does work, and behind the CLI is an architecture that pays for what it uses and goes dormant when it doesn't.

The product matters less than the proof-of-concept the build represents. One person can do this now. Not "one person with venture funding and an SRE on retainer." One person.

## The federal angle nobody is talking about

Here is the part that has not yet registered with most federal contractors but is going to in 2026.

On April 3, 2025, the Office of Management and Budget published M-25-21 (Accelerating Federal Use of AI through Innovation, Governance, and Public Trust) and M-25-22 (Driving Efficient Acquisition of Artificial Intelligence in Government). Read together, the two memos describe a federal AI acquisition posture that explicitly prioritizes vendor-portable systems, fixed-scope deliverables, ongoing testing and monitoring rights, and protection against vendor lock-in. M-25-22 directs agencies to consider vendor lock-in at every stage of the AI acquisition lifecycle — initial demonstrations, solicitation provisions, contract awards, ongoing data access — and requires solicitation provisions for knowledge transfer, data and model portability, and licensing and pricing transparency.

If you have read those two memos as a federal AI vendor and you are still pitching reserved-cluster, vendor-specific, single-region AI deployments, you are mispricing your own risk. The acquisition guidance reads like a buying spec for the kind of architecture I just described. Stateless inference that can run anywhere. Standard protocol surfaces (MCP) that don't lock the agency into one vendor's tool ecosystem. Per-request cost transparency. Fixed-scope, deliverable-based engagements rather than reserved-capacity contracts that bill regardless of usage.

The SDVOSB and small-business set-aside lanes are about to be the most interesting place in federal AI procurement, because the small vendors are _structurally_ better positioned to deliver on the new acquisition spec than the integrators that built their federal AI practices on reserved-cluster sales. The integrators will adapt — they always do — but the adaptation cycle is 18 to 24 months, and the small vendors that show up next quarter with the right architecture are going to capture the wedge.

This is the consulting wedge I'm pushing at Truvisory®. Not "we'll help you do AI." Specifically: federal AI modernization that is Cloudflare-native, MCP-first, pay-per-inference, vendor-portable, and structured to satisfy M-25-21/22 from day one. That's a buying spec the federal market is going to be looking for, with a small-business set-aside attached, and most of the existing federal AI vendors are not positioned to bid against it.

## The new arithmetic

The thing I want you to take from this isn't "Cloudflare good, Kubernetes bad." Kubernetes is a good piece of software. The issue isn't with the orchestrator. The issue is with the _acquisition model_ underneath the orchestrator.

The new arithmetic isn't "how do we get GPU efficiency from 5% to 30%?" The new arithmetic is "do we need a GPU on the books at all?" For most teams shipping AI features in 2026, the answer is no. The platform owns the hardware. You own the application logic. You pay for what you used, you sleep when you're not using it, you scale to zero when traffic dies, and you scale to whatever the workload requires when it doesn't.

The teams that internalize this in 2026 will spend the next two years compounding an efficiency advantage their competitors can't match. The teams that hold the reserved-capacity line will spend the same two years explaining to their CFOs why their AI cost per customer keeps going up while their utilization keeps going down. Those conversations are going to get short.

Cast AI's report ends with the observation that workloads change, traffic patterns shift, and the configuration that was accurate six months ago is unlikely to remain accurate today. That's true on Kubernetes, and it's the reason 5% is the average. It's also the reason the architecture that wins isn't a smarter rightsizing pass on the same model. It's a different model entirely, one where the platform absorbs the variance and you absorb the value.

Workers AI on Cloudflare. Durable Objects for coordination. MCP for tool surfaces. RLM-style orchestrator-plus-scout for the workload shape. Pay-per-inference for the bill.

The math, finally, works.

---

Tony Adams is the founder of HotCopy and Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. Verified SDVOSB and VOSB, SAM.gov-registered.

### OMB M-25-21 reads like a buying spec for fixed-scope AI. — /insights/omb-buying-spec/

The two federal AI acquisition memos that came out of OMB in April 2025 didn't get the press they deserved. Most of the coverage at the time framed them as a Trump-administration rewrite of Biden-era AI policy — which is technically true and substantively misses the story. Read together, M-25-21 and M-25-22 are the most specific federal AI buying spec the U.S. government has ever published. They tell you, in roughly 60 pages of combined text, exactly what kind of AI vendor a federal agency is supposed to buy from in 2026 and beyond.

If you have read the memos as a federal AI vendor and you still pitch reserved-cluster, vendor-specific, single-region AI deployments wrapped in a level-of-effort contract, you are mispricing your own risk. The buying spec is right there in the text, and most of the established federal AI integrators are not positioned to deliver against it.

Here is what the memos actually require, why the requirements line up with what we have already been pricing on the commercial side, and what the practical SDVOSB / small-business wedge looks like for the next two years.

## What the memos actually say

M-25-21, "Accelerating Federal Use of AI through Innovation, Governance, and Public Trust," establishes the governance framework. Every covered agency designates a Chief AI Officer, publishes a public AI strategy, maintains a public AI use-case inventory, and adopts minimum risk management practices for any AI system designated "high-impact." Agencies have until April 15, 2026 to bring every high-impact system into compliance or shut it down. This is not a draft. It is the operating policy of the executive branch.

M-25-22, "Driving Efficient Acquisition of Artificial Intelligence in Government," is the procurement counterpart. It applies to solicitations issued on or after October 1, 2025 and to contract renewals after that date. The memo's three core policies are: (1) ensure a competitive American AI marketplace, (2) safeguard taxpayer dollars by tracking AI performance and managing risks across the lifecycle, and (3) promote effective acquisition through cross-functional engagement.

Strip the policy language out and translate it into what a contracting officer is going to ask you for. The memo directs agencies to consider vendor lock-in at every stage of the AI acquisition lifecycle — initial demonstrations, solicitation provisions, contract awards, ongoing data access. It directs them to include contract terms for knowledge transfer, data and model portability, and licensing and pricing transparency. It requires ongoing testing and monitoring rights, with the agency able to evaluate performance, risks, and effectiveness throughout the contract period of performance. It bars vendors from training publicly or commercially available models on non-public agency data without explicit consent. It instructs agencies to ensure that contracts clearly delineate IP rights, with a strong default that the government retains rights to code and models produced under the contract.

The "Buy American" framing got most of the press attention. The portability and lifecycle-monitoring requirements are the part that actually changes who can win this work.

## Translating the memos into a buying spec

If you read M-25-22 as a procurement officer trying to write a SOW that will not embarrass the agency in 18 months, the picture comes into focus quickly. You want a vendor that:

Can demonstrate working AI before award (the memo emphasizes testing-before-purchase rights), then deliver a system the agency can continue to test and monitor throughout the period of performance. You want a vendor whose architecture doesn't lock you into a single cloud, a single model provider, or a single inference vendor — because lock-in is now an enumerated procurement risk you have to mitigate. You want a vendor that delivers fixed-scope, deliverable-based engagements rather than reserved-capacity contracts that bill regardless of usage, because the memo's emphasis on tracking AI performance and managing costs aligns naturally with outcome-based pricing. You want a vendor whose IP terms are clean — the government gets the code, the models, and the data rights it needs to keep operating the system without the vendor, if necessary.

That is a buying spec. It is not vague. It tells you, fairly precisely, what architectural and contractual posture a vendor needs to have to win against it.

Now hold that up against the typical federal AI engagement circa 2023. Long-term reserved capacity in a single vendor's cloud. Proprietary model fine-tunes that don't port. Custom serving infrastructure that requires the vendor to operate it indefinitely. Level-of-effort billing that grows whether the system is working or not. Tool integrations that are bespoke to the vendor's framework rather than built against an open protocol.

The 2023 engagement model is exactly what M-25-22 is written to prevent. The integrators that built their federal AI practices on the 2023 model are going to adapt — they always do — but the adaptation cycle is 18 to 24 months, and the small vendors that show up next quarter with the right architecture are going to capture the wedge.

## Why this matches the commercial pricing model already

The reason this matters is that the architecture and the pricing model required by the federal memos is the same architecture and pricing model that survives the math on the commercial side.

I wrote about this last month in the context of the Cast AI 2026 State of Kubernetes Optimization Report — at 5% average GPU utilization across 23,000 production clusters, the reserved-capacity model is structurally underwater for most workloads, and pay-per-inference + edge-stateful-runtime + standardized tool surfaces is the architecture that actually pencils. The commercial market is converging on that architecture because the unit economics demand it.

The federal market is converging on the same architecture because the _acquisition policy_ demands it. Pay-per-inference is portable by construction — you are not reserving capacity in a specific vendor's hardware, you are buying tokens against a standard contract. Stateless inference at the edge is auditable by construction — every request is a discrete billable event that produces a log line, which is exactly the artifact the memo's ongoing-monitoring requirements need. Standardized tool surfaces (MCP) are portable by construction — the contract between the agent and the tool is defined by an open protocol, so an agency can swap one vendor's agent for another's without rewriting the integration layer.

Three years ago, the commercial and federal AI buying patterns were diverging — federal needed audit trails and security controls that commercial didn't, commercial needed scale that federal didn't. In 2026 they are converging on the same architectural answer, for different reasons, with the same vendors winning on both sides if they're positioned correctly.

## The SDVOSB and small-business wedge

This is where it gets interesting for veteran-owned and small-business AI vendors.

The traditional federal AI vendor profile — large integrator, multi-year staffing contract, reserved cloud capacity, level-of-effort billing — is structurally mismatched to the new acquisition spec. Not because the integrators can't deliver good work. They can. The mismatch is in the _contracting posture_ they are built around. A 500-person federal AI practice does not want to bid fixed-scope, deliverable-based engagements with portability requirements, because that compresses the margin model the practice was sized for.

Small vendors with the right architecture have the opposite problem. They cannot bid 50-person staff augmentation contracts because they don't have 50 people. They _can_ bid fixed-scope architectural engagements that deliver a working, portable, monitored AI system in 90 days with clean IP transfer and a maintenance plan. Which is exactly what M-25-22 is written to favor.

The set-aside lanes — SDVOSB, VOSB, WOSB, 8(a), HUBZone — are about to be the most interesting place in federal AI procurement, for two reasons. First, the acquisition spec rewards architectural posture more than headcount, and small businesses can deliver the right architecture without the contractual gravity of the integrators. Second, agencies are under explicit pressure to meet small-business contracting goals, and a small-business vendor that shows up with a memo-compliant proposal is the path of least resistance for a contracting officer who needs to thread three different requirements at once.

The NAICS codes that matter for AI-specific federal work — 541511 (Custom Computer Programming), 541512 (Computer Systems Design), 541690 (Other Scientific and Technical Consulting) — all have small-business size standards that most legitimate AI specialty shops fall comfortably under. 541511 and 541512 use a $34M revenue standard. 541690 uses $19M. The set-aside addressable market for AI work in those NAICS codes is, conservatively, hundreds of millions of dollars in FY26 alone, growing fast.

## What a memo-compliant proposal actually looks like

Practical translation, if you are writing a federal AI proposal in 2026 and want to map directly to the memos:

Lead with the architecture, not the headcount. The proposal section that explains how the AI will be built, hosted, monitored, and handed off matters more than the org chart. Demonstrate vendor portability explicitly — name the open protocols you build against (MCP for tool surfaces, OpenAI-compatible APIs for inference, standard logging/observability stacks), and explain what the agency can do without you if it ever needs to. Commit to fixed deliverables tied to working, testable AI behavior, not to staffing levels or hours billed. Include a model and data portability plan as a deliverable — what gets handed over, in what format, on what schedule. Specify the audit trail you will provide for high-impact use cases, mapped to the M-25-21 minimum risk management practices (pre-deployment testing, AI impact assessments, ongoing monitoring, human review, end-user feedback). Specify the IP rights the agency receives, defaulting to maximum rights for the agency. Specify what happens to non-public agency data — explicitly bar training on it without consent, and provide a contractual mechanism for the agency to verify this.

Every one of those items is in M-25-22 by name. None of them are optional for new solicitations. A proposal that addresses all of them by name is a proposal that is _demonstrably_ memo-compliant, which is what a contracting officer needs to defend the award.

This is the proposal posture I am taking at Truvisory® for federal AI modernization engagements. Not because the architecture is novel — the architecture is what the commercial math already requires. Because the _language_ of the proposal has to map directly onto the acquisition memo, line by line, so the contracting officer can do their job without rewriting the procurement file.

## The buying spec is hiding in plain sight

The OMB memos are public documents. They have been live since April 2025. The October 1, 2025 effective date for new solicitations has already passed. The April 15, 2026 compliance deadline for high-impact systems is six weeks out as I write this. The federal AI procurement landscape has already shifted under the spec the memos define, and most of the existing federal AI vendor ecosystem is still pitching the 2023 engagement model.

This is the consulting wedge. Not "we'll help you do AI." Specifically: federal AI modernization that is vendor-portable by construction, pay-per-inference for cost transparency, MCP-first for tool portability, and structured to satisfy M-25-21 and M-25-22 from day one. That's a buying spec the federal market is going to be looking for, with small-business set-asides attached, and the small vendors with the right architecture are going to capture the wedge before the integrators finish their adaptation cycle.

The memos are not a constraint. They are a buyer's list. Read them as one.

---

Tony Adams is the founder of HotCopy and Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. Verified SDVOSB and VOSB, SAM.gov-registered.

<li>Executive Order 14179, Removing Barriers to American Leadership in AI (Jan 23, 2025)</li>
<li>Companion post: <a href="/insights/gpu-math/">At 5% GPU utilization, the math doesn't work</a></li>

### Durable Objects are the missing primitive for production agents. — /insights/durable-objects/

Most of what gets called "agent infrastructure" in 2026 is plumbing that exists to compensate for the absence of one specific primitive: a stateful, durable, addressable execution context that is cheap enough to run one per agent instance.

If you have built a multi-agent system on Kubernetes or on a generic serverless platform, you have personally written some version of every workaround for the missing primitive. You have built a queue to hand work between agents. You have built a state store to remember what an agent was doing across requests. You have built a scheduler to wake agents up. You have built a routing layer to send a specific request to a specific agent's state. You have built reconnection logic for WebSocket clients to find their way back to the same agent instance after the load balancer rerolls. You have, in other words, spent a substantial percentage of your engineering budget rebuilding what Durable Objects give you for free.

I have shipped three production agent systems on Durable Objects in the last six months. The pattern has held in all three. The data tier doesn't exist as a separate concern. The queue doesn't exist as a separate concern. The scheduler doesn't exist as a separate concern. The infrastructure stack collapses to the application code, and the application code gets dramatically smaller. This essay is the architectural argument for why that happens, and what it looks like in practice.

## The shape of the missing primitive

The agent problem, abstracted, is this: a long-running computational entity with an identity, internal state that persists across calls, the ability to receive events from multiple sources (HTTP requests, WebSocket messages, scheduled triggers, inbound email, other agents), the ability to wake itself up on a schedule, and the ability to go dormant when nothing is happening so you're not paying to keep it warm.

In actor-model terms, this is just an actor. In object-oriented terms, it's a long-lived object instance. In operating-system terms, it's a process with a known PID. None of those abstractions are exotic. The thing that makes them hard at production scale is that you want millions of them, addressed by name, available globally, with state that survives the host machine being rebooted, and you do not want to pay to keep them all warm in memory.

Standard cloud primitives don't give you this. A pod doesn't have durable state by default. A function-as-a-service invocation is stateless and amnesiac. A database row has state but no compute. A queue has compute-on-arrival but no addressable identity. You end up gluing these together — a pod with a sidecar reading from a queue, persisting to a database, registered in a service mesh — and the glue is the work that consumes the engineering team.

A Durable Object is a Cloudflare Worker that uniquely combines compute with storage. Each one has a globally unique name. Each one has its own private, embedded SQLite database (up to 10 GB on Workers Paid). Each one is automatically provisioned geographically close to where it is first requested, starts up quickly when needed, and shuts down when idle. They migrate among healthy servers without the application caring. There is no node pool, no cluster, no scheduler to configure, and no warm-pool capacity to reserve. You write a class, give each real-world thing a stable name, and the platform handles everything else.

That is the missing primitive. The reason it matters is that it changes what shape of system one person can build.

## What the Agents SDK adds on top

The raw Durable Object primitive is general-purpose. The Cloudflare Agents SDK is an opinionated wrapper for the agent-shaped subset of what Durable Objects can do, and it is where most production agent code should live.

The class hierarchy is worth understanding because it tells you what's free and what you write. The full hierarchy is `DurableObject > Server > Agent > AIChatAgent`. DurableObject gives you compute-plus-storage and an addressable identity. Server, from the partyserver package, adds URL routing, named-instance addressing, an onStart lifecycle callback, and WebSocket session management. Agent extends Server with scheduling (one-time, cron, and on-delay), an MCP client for outbound tool calls, AsyncLocalStorage-based context propagation, email handling via Cloudflare Email Routing, and a built-in SQL table for managed schedules. AIChatAgent extends Agent with the streaming chat patterns most production agents need.

If you are building a chat agent, you start with AIChatAgent and you are 80% done before you write a line of business logic. If you are building a workflow agent or a background-processing agent, you start with Agent. If you need fully custom behavior, you go down to Server or directly to DurableObject. The descent into lower layers is rare in practice. The default opinions are correct for most agent workloads, which is what you want from an SDK: be right by default, get out of the way when you need it to.

A specific feature of the Agent class is worth calling out because it solves a problem that has cost me weeks on previous platforms. Durable Objects allow only one alarm at a time per object. The Agent class works around this by managing multiple schedules in a SQL table (`cf_agents_schedules`) and using a single Durable Object alarm to fire whichever schedule is next. Cron schedules automatically reschedule themselves after execution. One-time schedules delete themselves. You write `this.schedule(...)` and the entire scheduler is handled. There is no scheduler service to run. There is no scheduler service to monitor. There is no scheduler service to debug at 2 AM.

That pattern — "the platform absorbs the variance, the application code shrinks" — repeats throughout the SDK. WebSocket hibernation is handled. Reconnection routing is handled. State synchronization to connected clients is handled. The application code is what you write because there's nobody else to write it. Everything else is the platform.

## What I've shipped on this primitive

Three production systems over the last six months, briefly, because the pattern matters more than the specifics:

**Per-deck AI agents for a SaaS product.** Each presentation deck in the platform gets its own Durable Object, addressed by deck ID. The DO holds the per-deck conversation state, embeddings index pointers, RAG namespace assignment, and the audience interaction history for that deck. When a presenter starts the deck, the DO wakes. When audience members SMS in questions, the DO routes them through the appropriate retrieval pipeline. When the presenter ends the session, the DO hibernates. The data isolation guarantee is _architectural_, not policy — a different deck is a different DO with a different SQLite database, full stop. There is no shared multi-tenant database to leak across. The deployment is one Worker and one DO class. There is no separate API service, no Redis, no per-tenant database, no queue. The whole system is the DO and the Workers AI calls it makes.

The chat surface below is an illustration of exactly that shape — one Durable Object, addressed by a deck ID, holding its own SQLite-backed session state. It is a static demo, not a live agent, but the architecture it sketches is the one shipping in production:

**A recursive coding agent CLI.** Each user session is a DO, addressed by session ID. The orchestrator runs in the DO. The scouts are fan-out Workers AI calls that the orchestrator dispatches in parallel. The DO holds the full session transcript, the current task state, the tool authorization tokens, the file modification log, and the rollback points. The user can quit the CLI and resume the session three days later — the DO has been hibernated, comes back online when the CLI reconnects, and the session continues from exactly where it left off, because the SQLite database has been sitting there the entire time. Resumability is not a feature I built. It is a property of the architecture.

**A federal-pilot document-processing agent.** Each document submitted by an end user becomes a DO, addressed by document ID. The DO holds the parsing state, the OCR results, the per-page redaction decisions, the audit log of every model call made against the document, and the human-review status. The audit log is the deliverable — every inference, with prompt, model, latency, and output, written to the DO's SQLite database at the moment it happens. When the agency wants the audit trail for any document, they query the DO. The audit trail is a property of the runtime, not a separate compliance system bolted on the side.

Three systems. Same primitive. The infrastructure code I wrote across all three would fit in a hundred lines, generously. The remaining code is application logic.

## Why this collapses the architecture stack

Let me make the architecture argument explicitly, because the "we ship faster" claim is downstream of a specific structural property.

In a traditional production agent architecture, the data tier is a separate system. You run a database. You connect to the database from the agent process. You manage the connection pool. You handle the schema migrations. You worry about transaction isolation. You worry about read replicas. You back the database up. You restore it when it breaks. The database is somebody's job — at scale, it's a team's job.

On Durable Objects, the data tier is the agent. Each DO has its own embedded SQLite database, isolated by construction, transactionally consistent with the agent's own execution, and persisted to durable storage by the platform. You issue a SQL statement against the DO's own `storage.sql` handle and you are querying an instance-private database that nobody else can touch. The connection pool doesn't exist. The migration story is whatever your application does on startup. The transaction isolation is single-writer-per-DO, which is the strongest possible guarantee. The replication is the platform's problem.

The queue tier collapses similarly. In a traditional system, you have a queue between services so that work can be handed off durably. On Durable Objects, work directed at a specific agent is just an RPC call against that DO's name. The DO is single-threaded by default, so requests serialize against the DO naturally. You don't need a queue to coordinate "exactly one worker handling this thing at a time" — the runtime is already that.

The scheduler tier collapses similarly. The Agent class's built-in schedule management means you write `this.schedule('0 9 * * 1', 'sendWeeklyReport')` and the scheduling is done. There is no cron service. There is no scheduler control plane. There is no missed-trigger investigation.

The session-affinity tier collapses similarly. WebSocket clients connect by Durable Object name. Reconnection finds the same DO because the addressing is the same. There is no sticky-session configuration. There is no Redis-backed session lookup. There is no "which pod is this user on" question.

What's left, when those four layers collapse, is the application code. Which is what should have been the whole job all along.

## The honest tradeoffs

I would not be writing this if Durable Objects were unconditionally the right answer. They are not. The tradeoffs that matter:

**Single-threading per object.** Each DO is single-threaded. If your "agent" is actually a high-throughput aggregator that needs to handle a thousand concurrent requests against shared state, the single-DO model is going to bottleneck. The answer is usually to shard the workload across many DOs and have a coordinator (often another DO) route requests to them. If your mental model is "one big database with a hot row," you need to rethink the data model before you reach for Durable Objects.

**Cold-start latency.** A DO that has been hibernated takes a beat to wake. For most agent workloads this is fine — agents are not latency-sensitive at the millisecond level. For some workloads (live voice agents, interactive editing) the cold-start can be noticeable, and you architect around it by keeping high-traffic DOs warm with light periodic activity.

**Vendor concentration.** Durable Objects run on Cloudflare. The primitive does not have a drop-in equivalent on AWS or GCP today, though the actor model exists in various forms (Orleans, Akka, etc.) elsewhere. If your procurement requirements demand multi-cloud, you will need an abstraction layer above the primitive, which gives back some of what the primitive was buying you. For most commercial workloads this is not a real constraint. For some federal workloads it is, and I write to that constraint explicitly when I'm structuring the procurement (portability of the application logic and the data model, even when the runtime is single-vendor).

**Operational visibility.** "Where is my agent right now and what is it doing" is an operational question that the platform handles for you in normal operation, but when something goes wrong, the debugging story is different from "ssh into the box and look." You lean on logging, on tail-streamed traces, on the platform's observability tools. This is mostly a learning curve, not a real limitation, but it is real and worth budgeting time for.

None of those tradeoffs change the architectural claim. They are the cost of admission to the architectural model, and for production agent workloads in my experience the cost is paid back inside the first month of shipping.

## The pattern, generalized

If I had to compress the architectural lesson of the last six months into a single sentence, it would be: **the right unit of deployment for a production agent is the agent, not the service that hosts the agent.**

Every conventional production architecture has you deploying the _host_ — the API service, the worker pool, the database cluster — and then arranging for the agents to live somewhere inside those hosts. That's backward. The agent is the noun. The host should be invisible. Durable Objects make the host invisible, and that's why the architecture collapses to the application code.

Three systems shipped on this primitive in six months. The data tier doesn't exist as a separate concern. The queue doesn't exist as a separate concern. The scheduler doesn't exist as a separate concern. What's left is the work.

This is what production agent infrastructure looks like in 2026. The teams that have figured this out are going to keep getting faster. The teams that are still standing up Kubernetes clusters to run their agents are going to spend the next two years explaining why their AI cost per feature keeps climbing.

The primitive has been there the whole time. Most of the industry just hasn't picked it up yet.

---

Tony Adams is the founder of HotCopy and Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. Verified SDVOSB and VOSB, SAM.gov-registered.

<li>Companion posts: <a href="/insights/gpu-math/">At 5% GPU utilization, the math doesn't work</a> and <a href="/insights/omb-buying-spec/">OMB M-25-21 reads like a buying spec for fixed-scope AI</a></li>

### Why mid-market keeps overpaying for AI strategy. — /insights/mid-market-strategy/

Five sales calls in the last six weeks. Different industries, different revenue bands, same conversation. The COO sits down, opens the laptop, and walks me through three slide decks they've already received from Big 4 advisory firms quoting $200K to $500K for an AI strategy and roadmap. The decks are professionally produced. The framework slides are clean. The case studies are real. The pricing footnotes are vague.

I ask the same question every time. "If you accept this proposal, who ships the system?"

Every time, the answer is some version of: "That's a separate engagement."

That is the arbitrage. The strategy deliverable is decoupled from the system that delivers the strategy's outcomes. The firm that wrote the roadmap is not on the hook for the roadmap working. The firm that builds the system, if there is one, inherits a 60-page document with executive summaries and target architectures and a long list of recommended vendor evaluations, none of which describes a working system in enough detail to actually build it. The mid-market COO has paid a quarter-million dollars for a planning artifact that produces another procurement, which produces another planning artifact, which eventually — if the budget survives — produces a project.

This pattern is so consistent that I have started keeping track. Of the five companies I have sat with this quarter, four had paid for AI strategy decks. Two had paid for two separate AI strategy decks from different firms. One had paid for three. The average spend on strategy work before a single line of code was written was just over $340K. Across the five companies, total time elapsed from "we should do AI" to "we have AI in production" was zero days. None of them had production AI yet.

This essay is about why that pattern exists, why it persists, and what the arbitrage looks like for vendors who can both think and ship.

## How the pattern got entrenched

The Big 4 advisory model was built for technology categories that are slow, expensive, and politically charged inside the buying organization. ERP migrations. Cloud adoption. Data warehouse rebuilds. In those categories, a six-month strategy engagement with a roadmap deliverable is a defensible spend, because the implementation will take 18 to 36 months and cost ten times the roadmap, and the strategy work materially de-risks the implementation. You pay for the planning artifact because the implementation downstream is so expensive that getting it 10% more right is worth seven figures.

The advisory firms ported that model to AI. The shape of the deliverable did not change. The shape of the implementation did. AI features in 2026 do not take 18 to 36 months to ship. A competent solo engineer on a modern stack can ship a customer-facing AI feature in two to six weeks. A small team can ship a multi-agent production system in a quarter. The implementation is now smaller than the strategy deliverable, in both time and cost, and the strategy work no longer de-risks the implementation because the implementation is short enough that you can just _try it_ and see what happens.

This is the inversion the advisory firms have not adjusted to. When the implementation is the slow expensive thing, planning is a good investment. When the implementation is the fast cheap thing, planning is just a tax. The mid-market COOs paying $340K for strategy decks are paying the tax without realizing the underlying physics of the project category has changed.

The reason the advisory firms have not adjusted is straightforward. Their cost structure is partners, principals, and a leverage pyramid of analysts. That cost structure requires high-margin advisory engagements to function. A two-week build engagement does not feed the pyramid. The economic incentive of the firm is to keep selling the deliverable shape that matches the cost structure, even when the deliverable shape no longer matches the project category. This is not malice. It is gravity. Firms cannot easily restructure themselves away from the engagement shape that pays for their offices.

## What the mid-market actually needs

The COO calls I've been on are not asking for thought leadership. They are asking for two things, and they often don't have language for the second one because the advisory pitch has trained them to ask for the first.

The thing they ask for: "We need a strategy for AI in our business."

The thing they actually need: a working AI feature, shipped to production, in a part of their business where the value is observable to a non-technical executive inside one quarter, with a clean handover model so they can iterate on it without paying a vendor in perpetuity.

The first ask is what a Big 4 deck answers. The second ask is what nobody is bidding against, because the advisory firms can't, and the integrators are still pricing 12-month staff-augmentation engagements that don't match the new physics either.

The arbitrage is that there is a $50K to $150K engagement shape — fixed scope, working deliverable, four to eight weeks, clean IP transfer — that satisfies the COO's actual need and does not exist in the market as a standard offering. The vendors who could deliver it are mostly small specialty shops that don't have a sales motion to reach the mid-market COO. The mid-market COO doesn't know to ask for it, because the inbound pitches they receive are uniformly the advisory model.

Closing the gap is the work. The firm that builds a repeatable motion for "executive identifies the business outcome, vendor ships the system in six weeks at fixed price, agency keeps the code" captures a wedge that the existing advisory market cannot price against without restructuring itself.

## The cost-of-delay nobody calculates

Here is the part of the conversation I have started forcing onto every COO call, because it changes the decision frame in a way most of them haven't internalized.

The $340K average strategy spend is not the real cost. The real cost is the six to nine months the strategy engagement consumes before any AI ships, during which the company is operating without the AI feature that the strategy is recommending. If the recommended AI feature would have produced $100K/month in margin or savings or pipeline acceleration, the company has left $600K to $900K on the table while they paid $340K to plan the thing. The strategy work cost a million dollars in real money, not $340K, once you account for what the company didn't ship while they were getting strategized.

This argument lands hard with COOs who are graded on quarterly numbers, and it lands soft with COOs who are graded on multi-year transformation initiatives. The mid-market COOs I'm calling on are mostly graded on quarterly numbers. Which is why the pitch I'm running closes more often than it should: I'm pricing against the cost of delay, and the advisory firms are pricing against the planning artifact.

## What this looks like from the buying side

If you are a mid-market executive reading this, the questions to ask any AI advisory vendor pitching you in 2026 are short and direct:

**What working AI feature do you ship as part of this engagement, and to what level of production-readiness?** If the answer is "we don't, we recommend who you should buy from," you are buying a planning artifact, and you should price it as such — meaning, against the planning artifacts you can get from other vendors, not against the value of the system that will eventually exist.

**How much of the engagement budget is spent on partners, principals, and senior advisors versus on the engineers who would actually build the system?** If the ratio is heavily weighted to the advisory side, you are paying for the cost structure, not the deliverable.

**What does the handover look like at the end?** If the answer involves perpetual vendor dependency, ongoing managed services contracts, or "we'll have a team embedded for the next 18 months," the engagement is structured around the vendor's revenue continuity, not around your operational independence.

**What is the cost-of-delay calculation for the recommended AI feature, and how does the engagement timeline compare?** If the vendor cannot calculate the cost of delay, they are not engaged in the conversation that matters to your P&L.

If the answers to those four questions don't satisfy you, the engagement is mispriced for the work you actually need done. Push back. Or move on.

## Where I think this lands

The advisory firms are going to continue selling six-month strategy engagements at $200K to $500K for at least the next 18 months, because the inbound demand is real, the partners are well-compensated, and the COOs paying for the decks are mostly satisfied with the engagement experience even when the deliverable is not a working system. The market will not correct on its own. It corrects when a critical mass of mid-market executives compare their AI strategy spend against the AI features that actually shipped over the same period and start asking harder questions.

That moment is coming. Probably 2027. Possibly sooner if the macroeconomic environment tightens and CFOs start auditing advisory spend more aggressively.

Until then, the arbitrage is open. The vendors that can think and ship — that can sit in a COO conversation as a peer, identify the right first project, scope it tightly, price it as a fixed deliverable, and put working AI in production inside one quarter — are going to capture the mid-market AI work that the advisory firms can't structurally win. The window is the next 18 months. It closes when the advisory firms restructure or the COOs reset their expectations, whichever happens first.

Both of those changes are slower than the engagement cycle. Which is the whole point.

---

Tony Adams is the founder of HotCopy and Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. Verified SDVOSB and VOSB, SAM.gov-registered.

<li><a href="/insights/gpu-math/">At 5% GPU utilization, the math doesn't work</a></li>
<li><a href="/insights/omb-buying-spec/">OMB M-25-21 reads like a buying spec for fixed-scope AI</a></li>
<li><a href="/insights/durable-objects/">Durable Objects are the missing primitive for production agents</a></li>

### Recursive Language Models, in production. — /insights/rlm-production/

This is a field report. Not a paper review, not a theoretical argument, not a pitch deck. A specific job ran on a specific evening last week and produced specific numbers, and the architecture underneath those numbers is the same architecture the MIT CSAIL team described in arXiv:2512.24601 eight weeks earlier. The point of writing this is to put real production data against the academic claim and tell you what held, what didn't, and what the implications are for any engineering team thinking about how AI fits into the actual day-to-day of shipping software.

The job: refactor a five-year-old authentication implementation in a mid-sized TypeScript/Node monorepo from a legacy OAuth2 implicit flow with session cookies to OAuth2 authorization code flow with PKCE, refresh token rotation, and an updated token storage model. The codebase is roughly 180k lines across services. Auth touchpoints are scattered across roughly 60 files — middleware, route handlers, client libraries, test suites, fixture builders, and the migration scripts that touch any of the above. A competent engineer who has done this kind of migration before would estimate three to five days of focused work, plus testing, plus the inevitable follow-up week of catching the edge cases that didn't show up in dev.

We ran it on HotCopy. The job completed in **6 minutes and 14 seconds**. Total inference cost: **$1.34**. The output was a working migration with passing tests on the first synthesis pass, and one recursion cycle to fix a single edge case the orchestrator flagged before declaring done.

The next 14 minutes of your reading are the technical explanation of why those numbers are achievable on a managed RLM stack running on Cloudflare Workers AI, what specifically the orchestrator-plus-scout pattern compresses out of the traditional engineering workflow, and where this approach falls apart — because it does, in specific cases worth naming.

## What the paper actually proposed

The Zhang, Kraska, and Khattab paper from MIT CSAIL ([arXiv:2512.24601](https://arxiv.org/abs/2512.24601), December 2025) does not propose a new model. It proposes an _inference strategy_. The key reframing: treat the prompt as an external programmable environment that the model can decompose, examine, and recursively call itself over snippets of, rather than treating the prompt as a static thing the model has to consume in one forward pass.

The paper's headline empirical claim is that a Recursive Language Model implementation running on a smaller base model can process inputs two orders of magnitude beyond the model's nominal context window, and can outperform vanilla frontier LLMs and common long-context scaffolds on diverse long-context tasks at comparable or lower per-query cost. The specific benchmark numbers in the paper are interesting but not the point I want to make. The point I want to make is structural: the paper validates an architectural pattern that has been kicking around the agent-frameworks community for two years, but treats it as a serious inference paradigm rather than a hack. That's the unlock.

The pattern, in plain language: a more capable "orchestrator" model holds the high-level plan and decomposes the work. Less expensive "scout" models do the parallel sub-tasks — read this file, summarize that function, validate this constraint, draft this implementation. Results return to the orchestrator. The orchestrator synthesizes, validates, and either declares done or recurses on the failing parts. No single model ever has to hold the whole problem at once.

The reason this matters in production is that the engineering workflows we want to compress with AI are not bounded by model capability anymore. They are bounded by context — the amount of code, history, and surrounding system that has to be understood to make a correct change. The traditional approach of "stuff more into the context window" has been hitting diminishing returns since mid-2024. Context rot is real. Models get worse at recall as the window fills, regardless of what the marketing says about token counts. The RLM pattern routes around the problem by making sure no single model invocation has to look at more context than it can actually use.

## The execution, step by step

Here is what actually happened during those 6 minutes and 14 seconds.

**Seconds 0–8: indexing and dispatch.** The CLI shipped the local codebase manifest to the orchestrator — file paths, dependency graph, test surface map. The orchestrator received the task description (the OAuth2 → PKCE migration prompt) and produced an initial dispatch plan. The plan identified 9 parallel scout missions: map all current auth touchpoints, audit token storage patterns, audit session cookie usage, identify all redirect URI configurations, build PKCE code-verifier/challenge generation strategy, identify breaking changes for downstream services, draft the new middleware implementation, identify test surface that needs updating, and identify migration script patterns.

**Seconds 8–94: parallel scout execution, first wave.** The 9 scouts ran simultaneously. Each scout received only the files relevant to its mission — between 4 and 18 files per scout, never more than 40k tokens of code. Each scout produced a structured output back to the orchestrator: a findings document with specific file paths, line ranges, current patterns, and proposed changes. Total wall-clock for this wave: 86 seconds. If you imagine those nine missions run sequentially by a single big model, with each one waiting on the previous, the same work takes between 12 and 18 minutes on a frontier model with a 200k-token context, and the cost is substantially higher because each call carries the full project context whether it needs it or not.

**Seconds 94–106: synthesis turn one.** The orchestrator received all nine scout reports. It produced a consolidated migration plan: the order of changes, the dependency relationships between them, the rollback strategy, and the test sequencing. This is the part of the work that benefits from a more capable model holding the whole picture — there are roughly nine documents to reconcile, with constraints flowing between them, and the orchestrator has to produce a coherent execution order that respects all of them. 12 seconds of compute on the orchestrator, holding maybe 60k tokens of synthesized scout output. No raw code in this context.

**Seconds 106–298: parallel scout execution, second wave.** Twelve scouts dispatched to actually write the changes. Each scout received the migration plan section relevant to it, the current contents of the files it would modify, and the relevant test fixtures. Each scout produced a unified diff for its files. Some scouts were trivially fast (single-file middleware update, 14 seconds). Some scouts were slower (the migration script generator took 92 seconds because it had to produce SQL that respected the existing schema and the new token rotation model). The wave completed when the last scout returned, at 298 seconds elapsed total.

**Seconds 298–342: synthesis turn two and validation.** The orchestrator received twelve unified diffs. It validated them against the migration plan: did every planned change get made, are there any conflicts in the diffs that would prevent clean application, are there any imports or references that one scout introduced that another scout removed. This is the part of the work that prevents the multi-agent system from producing the classic failure mode — six agents making locally correct changes that globally don't compose. The orchestrator caught one such conflict: the new middleware implementation referenced a utility function that the token storage refactor had renamed. It dispatched a tiny corrective scout (the recursion the paper talks about) to fix the import, completed at 342 seconds.

**Seconds 342–374: test run and final check.** The orchestrator applied the synthesized changes to a sandbox copy of the codebase and ran the existing test suite plus the new tests that the test-surface scout had drafted in the first wave. All tests passed. The orchestrator declared the job complete and surfaced the unified diff for review. Total wall-clock: 6 minutes 14 seconds. Total scout invocations: 22 (9 + 12 + 1 corrective). Total orchestrator turns: 3 (initial dispatch, synthesis one, synthesis two with validation).

The numbers, broken out:

- Scout invocations, 22 total: $0.92
- Orchestrator turns, 3 total: $0.42
- Total inference: **$1.34**

For a refactor that would consume between three and five engineer-days at a fully-loaded cost of somewhere between $4,000 and $8,000.

## Why Workers AI matters specifically

The architectural argument for running this pattern on Cloudflare Workers AI rather than on traditional cloud inference is more subtle than the GPU-utilization argument I made [in last month's post](/insights/gpu-math/), and worth being specific about.

The orchestrator-plus-scout pattern is _bursty_. For 86 seconds at the start of the job, we dispatched nine parallel scouts. For most of the rest of the job, nothing was running. If you provision GPU capacity for the peak fan-out (and you have to, because if you don't, your scouts queue and your wall-clock blows up), you provision for 12-scout parallel and you sit at 0% utilization for most of the job. Multiply this across thousands of jobs running asynchronously across many users, and the average utilization is precisely the 5% Cast AI documented as the production-cluster average.

Pay-per-inference inverts this. We pay for the 22 scout invocations and the 3 orchestrator turns that actually ran. We do not pay for the capacity those invocations _could have used_ but didn't. The fan-out is free as long as the work fits — twelve scouts in parallel costs twelve scouts in parallel, not "a reserved 12-GPU cluster sitting idle when the scouts complete." For a workload shape this bursty, the pricing model is the difference between $1.34 in inference and $50 in amortized cluster cost for the same job.

The second reason Workers AI matters specifically is placement. The CLI is shipping diffs back to the engineer's machine in real time. The orchestrator state is in a Durable Object close to wherever the developer is — Cloudflare runs in 330+ cities, the latency to the nearest one is typically single-digit milliseconds. The scouts run in regional inference centers near the orchestrator. The data path from "engineer presses enter" to "first scout returns" is short by construction, because the entire stack is co-located on the same edge network. On a traditional setup, the CLI calls an API gateway in us-east-1, which calls an orchestrator service, which calls an inference cluster, which routes to a GPU, which returns up the same chain. The round-trip latency overhead alone can double the wall-clock time of a job this short.

The third reason — and this is the one most engineers underestimate until they've run it — is that the orchestrator itself is a Durable Object. The full transcript of the job, every scout dispatch, every scout return, every synthesis turn, every recursion decision, is recorded in the DO's embedded SQL database at the moment it happens. When the developer comes back to the CLI three hours later to inspect what got changed and why, the entire reasoning trace is still there. No separate logging service. No "where did the orchestrator state go" question. The same primitive that runs the orchestration also records the orchestration, by virtue of being the same Durable Object. [I wrote about this primitive specifically last quarter](/insights/durable-objects/); the application to RLM orchestration is one of the cleanest payoffs of the DO-based architecture I've found.

## What the speed actually means

Six minutes for a three-day refactor is the headline number. It is also, on its own, the wrong way to think about what changed.

The interesting thing about compressing a multi-day workflow into a six-minute job is not that the work happens faster. It is that the _cost of attempting the work_ drops to nearly zero. An engineer evaluating whether to do a three-day refactor goes through a cost-benefit calculation: is this worth my next three days, given everything else on my plate, given the political work of getting buy-in from stakeholders, given the risk that I'll discover halfway through that the refactor is harder than I thought and I have to either ship something half-done or abandon a week of work. The cost-benefit math kills a lot of refactors that should happen. Technical debt accumulates not because engineers are lazy, but because the activation energy of paying it down is too high relative to the perceived benefit.

When the same refactor costs six minutes and $1.34 to _attempt_, the calculation changes entirely. The engineer dispatches the job, goes to get coffee, comes back to a unified diff that either solves the problem or surfaces exactly what's hard about solving the problem. If it solves it, great, ship it. If it doesn't, the engineer has spent six minutes and a dollar to develop a much more detailed understanding of why the refactor is hard, which informs the manual approach that follows. The exploratory cost has collapsed.

This is the part of the AI-for-engineering story that the productivity-multiplier framing misses. The metric that matters is not "how much faster does each individual task go." The metric is "how many tasks that previously didn't get done at all now get attempted, because the cost of attempting collapsed." Refactors that engineers used to defer indefinitely. Audits that nobody had time for. Test coverage backfilling that always lost out to feature work. Documentation generation that was nominally a priority and actually never happened. The work that was being deferred is the work that gets unlocked.

I have specific evidence of this in my own development workflow over the last six months. The number of refactors I've shipped is up roughly 4× year-over-year. The number of new features is up modestly, maybe 30%. The compositional improvement is in the codebase quality, not the feature velocity. Which, frankly, is the right place for the improvement to land — features are downstream of a codebase you can reason about.

## Where this falls apart

Honest about the limits, because they exist and they matter.

**The pattern works best when the task decomposes cleanly into parallel subtasks.** Refactoring is one of the cleanest cases — different parts of the codebase can be analyzed and modified independently, and the orchestrator's job is mostly to keep them coherent. Tasks that don't decompose cleanly — fundamentally sequential reasoning chains, where each step depends on the previous step's output — don't benefit from the fan-out and don't see the same compression. Debugging a complex production incident is often more sequential than parallel; the RLM pattern helps less there.

**The pattern requires the underlying codebase to be amenable to local reasoning.** If your codebase has implicit state everywhere, undocumented coupling between modules, or behavior that depends on environment-specific configuration not visible to the agents, the scouts will produce locally correct changes that globally don't work. The pattern is not a substitute for code quality. It works dramatically better on a codebase where files have meaningful boundaries and dependencies are mostly explicit. Which, to be fair, is a property you should want for human engineers anyway, but the AI pattern surfaces the absence of that property faster.

**The orchestrator can be wrong.** When the orchestrator makes a bad dispatch decision — decomposes the work in a way that doesn't actually parallelize, or sends scouts after the wrong subgoals — the entire job degrades, sometimes silently. The recursion mechanism helps with this (the orchestrator can notice it didn't get a clean synthesis and re-dispatch), but the failure mode of "orchestrator confidently produces a coherent-looking but wrong plan" is real. The mitigation is to run the orchestrator at higher-capability tier than the scouts, which is what the pattern recommends anyway. The cost is acceptable because the orchestrator is only a few turns of the total work.

**The pattern doesn't help with the work the engineer should be doing.** Architectural decisions, product judgment, deciding what the right thing to refactor _toward_ — these are not tasks you dispatch to a scout. The RLM pattern compresses execution. It does not compress judgment. An engineering org that uses this pattern well will find their engineers spending more time on the work that requires actual thought, because the execution overhead of the work that didn't require thought has dropped. An org that misuses it will produce a lot of automated technical debt very efficiently.

**Audit and verification still matter.** The orchestrator validates that the scouts didn't conflict with each other and that the test suite passes. It does not validate that the test suite is sufficient. It does not validate that the refactor was the right refactor to do. The engineer reviewing the diff is still doing the work of "is this actually what I wanted." If you skip that step, you are running the pattern wrong, and you will eventually ship something subtly broken at high speed. The compression is in the execution. The judgment is still yours.

## What this implies for 2026

I'll close with the structural implication, because the field-report data is interesting on its own but the question I keep getting from engineering leaders is "what does this mean for how I should staff and structure my team in 2026."

The RLM-on-edge-inference pattern compresses execution time for code-heavy tasks by something like 50× relative to manual work, at a cost reduction in the same order of magnitude. This is not a productivity improvement of the kind that requires reorganization. It is a change in the unit economics of certain categories of engineering work, and the categories where it applies most strongly (refactors, audits, test backfilling, documentation, migrations) are categories that most engineering orgs have historically been chronically under-resourced on.

The first-order effect, which I think most orgs will internalize in 2026, is that one engineer becomes capable of operating at the throughput of a small team for these categories of work. The second-order effect, which most orgs will internalize in 2027, is that the _bottleneck_ shifts from execution to direction-setting. When ten engineers can each do five times as much execution as they used to, the constraint becomes "what should they be doing." That is a leadership problem, not an engineering problem, and most engineering leaders are not yet structured to operate at the increased decision throughput their teams now require of them.

The third-order effect, which I'd bet on for 2028 but won't insist on, is that the optimal team size for many software products gets meaningfully smaller. Not because engineers are being replaced, but because the coordination overhead of a large team starts to exceed the marginal output it produces when each engineer's individual throughput has multiplied. The mid-sized engineering org of 2023 — 40 to 80 engineers, multiple layers of management, dedicated platform and infrastructure teams — is not obviously the right shape for a 2028 product company. I would not bet against 15-person companies shipping product surface that today requires 80.

The architecture under all of this — orchestrator-plus-scout running on pay-per-inference edge compute, with stateful coordination in durable lightweight runtimes — is the architecture that makes the unit economics work. Recursive Language Models are the inference pattern that makes the work tractable. The combination is what's new. The implications are still being worked out by everyone, including me.

This is the field report. The numbers are real. The pattern works. The implications are interesting. The work continues.

---

Tony Adams is the founder of HotCopy and Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. Verified SDVOSB and VOSB, SAM.gov-registered. HotCopy is in private beta at <a href="https://hotcopy.ai" target="_blank" rel="noopener">hotcopy.ai</a>.

<li>Companion posts: <a href="/insights/gpu-math/">At 5% GPU utilization, the math doesn't work</a>, <a href="/insights/durable-objects/">Durable Objects are the missing primitive for production agents</a></li>

### Region-pinning, audit logging, and the FedRAMP-aware edge stack. — /insights/fedramp-aware/

There is a specific class of federal AI engagement that does not fit cleanly into either of the two postures most vendors take. It is not full-ATO production work on a FedRAMP High environment, which requires a multi-year compliance investment and a contract value to match. And it is not pure commercial-stack work, which a federal contracting officer cannot accept for anything touching non-public agency data or production workloads. The class I'm describing — pilots, SBIR Phase I and II builds, prototypes, modernization R&D, sandboxed evaluations of new AI capabilities — sits in the middle. The agency wants real working software. The agency wants the data handled defensibly. The agency does not want, and cannot procure, a 36-month ATO process to evaluate a six-month build.

What that class needs is a _FedRAMP-aware_ commercial edge stack. Not authorized. Aware. Architected against the controls a future ATO would inherit, instrumented for audit, region-constrained where the data sensitivity requires it, and structured so that the eventual lift to a FedRAMP-authorized environment is an exercise in re-hosting rather than re-architecting. This is the engagement shape that the OMB acquisition memos (M-25-21 / M-25-22) implicitly favor, the shape that small-business set-aside lanes are sized for, and the shape that the Cloudflare developer platform supports natively if you architect to it deliberately.

This essay is the reference architecture I run for that engagement class. Five components, with the specific reasons each one is there. None of it is novel. All of it is precise about what the commercial Cloudflare stack does and does not do, where the FedRAMP boundary actually sits, and what an agency contracting officer should expect to see in a SOW.

## What "FedRAMP-aware" actually means

Before the architecture, a definition, because the term gets abused.

FedRAMP — Federal Risk and Authorization Management Program — is the government-wide program that standardizes security assessment, authorization, and continuous monitoring for cloud products and services used by federal agencies. A product is FedRAMP-authorized at one of three impact levels (Low, Moderate, High) only if it has completed the formal authorization process and is listed on the FedRAMP Marketplace. Cloudflare has FedRAMP-authorized offerings under the Cloudflare Government brand. The commercial Cloudflare developer platform — the one most engineers and most product builds use — is not FedRAMP-authorized. These are different SKUs, different control environments, and different contractual surfaces.

"FedRAMP-aware" is the term I use, and that I'd argue the industry should standardize on, for a commercial deployment that is architected against the controls a Moderate or High authorization would require, even though the deployment itself is not authorized. The agency cannot use a FedRAMP-aware deployment for production handling of FISMA-controlled data. The agency can use it for pilots, evaluations, prototypes, R&D, and any workload where the data sensitivity does not require an authorized environment. The point of building the pilot FedRAMP-aware is that when the pilot succeeds and the agency wants to move it to production, the lift is to a FedRAMP-authorized environment running the same architecture — not a re-architecture from scratch.

That is the engagement class this reference architecture serves. The rest of the essay is what the architecture looks like.

## Component 1: AI Gateway as the policy boundary

Every model call in the architecture goes through Cloudflare AI Gateway. Not most. Every.

AI Gateway is a policy-enforcement point that sits between your application code and the inference layer (Workers AI, OpenAI, Anthropic, or any provider you route through it). It gives you four things that matter for federal-aware deployments. First, it logs every inference request — model, prompt, response, latency, cost, status — to a durable log you control. Second, it lets you enforce rate limits, retries, model fallback, and caching at the policy layer rather than baking those policies into application code. Third, it lets you route to different providers based on policy (e.g., never send agency data to a provider that doesn't sign a BAA-equivalent agreement). Fourth, it gives you a single chokepoint to revoke or rotate inference access if something goes sideways.

The architectural point is not that AI Gateway does anything magical. The point is that _every model call goes through one named, configurable, observable hop_. That is the property the federal audit posture requires, and most agentic systems do not have it by default because most agent frameworks let the agent call the model provider directly. Routing everything through AI Gateway is one config change. The audit posture it enables is what the agency contracting officer is going to ask about.

The thing to be precise about: AI Gateway as deployed on the commercial Cloudflare stack is not a FedRAMP-authorized service. What it gives you is the _architectural shape_ that a FedRAMP-authorized inference deployment would have — single policy chokepoint, comprehensive logging, provider-routing controls — implemented on the commercial stack for the engagement class that doesn't require authorization.

## Component 2: R2 as immutable audit storage

Every AI Gateway log entry, every Durable Object state change you care about auditing, every document the system processes, every human-in-the-loop decision — written to R2, with object-lock semantics enabled, in a bucket that the application's runtime identity can write but cannot delete.

R2 is Cloudflare's S3-compatible object storage with zero egress fees. The features that matter for the audit-storage role are: object lock for write-once-read-many semantics (preventing the application from rewriting history if it gets compromised or misconfigured), lifecycle policies for retention scheduling, bucket-level access policies separable from object-level policies, and the ability to replicate buckets across regions for durability.

The architectural pattern is straightforward but worth being explicit about. The application Worker has a binding to write to the audit bucket. The bucket policy denies deletes from the application's identity. A separate retention-management role (held by a human operator, not the application) is the only identity that can mutate object lock periods. The audit trail is therefore _append-only by construction_ — the application cannot tamper with its own history even if it is fully compromised. This is the property a federal audit posture requires, and it is achievable with bucket policies and IAM discipline rather than with custom infrastructure.

R2 itself is not FedRAMP-authorized on the commercial stack. The architectural pattern, however, ports directly to S3 on AWS GovCloud or to a FedRAMP-authorized object store at the lift moment, because the only thing the application code knows about R2 is the S3-compatible API. Re-hosting the audit storage is a configuration change, not a code change. This is the property that makes "FedRAMP-aware" a real promise rather than a marketing line.

## Component 3: Region-pinned Workers with explicit data flow

This is the component where the language matters most, because "region-pinning on Cloudflare" means something specific that is different from how the term is used elsewhere.

Cloudflare Workers run at the edge by default — your code is provisioned to all of Cloudflare's 330+ cities of presence, and requests are routed to the nearest one. For most commercial workloads this is the entire point. For federal-aware workloads, you sometimes need to constrain _where_ a specific Worker executes, typically to meet data residency or jurisdictional requirements. Cloudflare provides several mechanisms for this — Worker placement modes, region-restricted Workers for Workers for Platforms, and the ability to constrain Durable Object placement via jurisdiction hints in the DO namespace configuration.

The architectural pattern I run for federal-aware deployments is this. Public-facing Workers (request ingress, static asset serving, rate limiting) run at the global edge — there is no benefit to pinning them and significant cost to it. Compute Workers that handle non-public agency data run in a constrained set of regions, typically US-only, configured explicitly in the Worker's deployment manifest. Durable Objects holding agency data are placed in a US-only jurisdiction via the namespace configuration. The data flow is documented as part of the architecture deliverable — every component, what region it runs in, what data crosses what boundary, what jurisdiction governs each store.

The documentation is the deliverable as much as the running system, because the agency's authorization-equivalent review is going to ask for a data-flow diagram, and the diagram has to match the running configuration. The mistake I see new federal-pilot vendors make is treating the configuration and the documentation as separate artifacts. They should be generated from the same source — typically the Wrangler config and a small documentation tool that reads it.

The honest tradeoff: pinning Workers to specific regions reduces the global edge benefit. Latency goes up for users outside the pinned region. For most federal pilots this is acceptable because the user population is also US-based. For agencies with international user populations (State, USAID, DoD components with overseas operations) this needs explicit attention in the architecture.

## Component 4: Role-based access via Cloudflare Access (Zero Trust)

Every operator-facing surface — admin UIs, dashboards, configuration endpoints, log viewers, the R2 retention-management role from Component 2 — sits behind Cloudflare Access with SSO integration to the agency's identity provider, MFA enforcement, and role-based policies.

Cloudflare Access is part of Cloudflare One (the Zero Trust platform) and provides identity-aware proxy access to internal applications. The federal-aware deployment pattern uses Access for three layers of control. First, the agency's existing SSO (Okta, Azure AD, agency-specific IdP) is the identity source — no separate user accounts in the application. Second, MFA is enforced at the Access layer rather than per-application, so the policy is consistent across every operator surface. Third, role-based access policies map agency users to specific application permissions — read-only auditor, retention manager, application administrator, end user — at the Access policy level, with the application reading the asserted role from a signed JWT.

The architectural point is the same as Component 1: a single named, configurable, observable hop. Every operator action authenticates through Access, and every authentication event is logged. The agency's security team has one place to look when they want to know who did what.

Cloudflare One has FedRAMP Moderate authorization for the components in Cloudflare for Government. The commercial Cloudflare One does not. The architectural pattern is portable, the SKU at the authorization moment is different, and the SOW language should be precise about which one is in scope for the pilot.

## Component 5: Continuous audit emission as a first-class output

The fifth component is not infrastructure — it is a discipline that has to be designed in from day one or it is impossible to retrofit.

Every action the system takes that an auditor would want to know about emits a structured audit event at the moment of action. Model call: prompt hash, model, parameters, response hash, latency, cost, request identity. Document processing: document hash, operation, operator identity, before-state, after-state. Human-in-the-loop decision: decision identifier, operator identity, decision, reasoning text, timestamp. Permission change: subject, object, before-permission, after-permission, operator identity.

The events are structured JSON, written to R2 (Component 2) via the append-only path, with content-addressed identifiers so that any event can be referenced from anywhere else in the system without ambiguity. The audit log is queryable — not just stored — via a separate Worker that runs in the same region constraints as the application and exposes a read-only query interface behind Cloudflare Access.

The reason this is a discipline rather than infrastructure is that it requires the application to emit the events. The platform can give you durable storage and immutability and access control. The application has to decide what is worth auditing and emit the events at the right moment with the right fields. Federal-pilot SOWs should specify the audit-event schema as a deliverable, because the agency will ask for it, and retrofitting it later is dramatically harder than designing it in.

This is the component that most distinguishes a "we built an AI thing on Cloudflare" pilot from a federal-aware deployment. The first one has the architecture. The second one has the architecture _and the trace evidence that the architecture is operating as designed_. The agency cannot accept the first one for anything beyond the most casual evaluation. The agency can accept the second one for a real pilot with real data, because the audit trail is the substitute for the formal authorization that the engagement isn't large enough to support.

## What the lift to FedRAMP-authorized actually looks like

When a pilot succeeds and the agency wants to move it to a FedRAMP-authorized environment, what changes?

The application code does not change. The audit schema does not change. The data flow does not change. The role model does not change.

What changes is the _SKU layer underneath_. AI Gateway moves from commercial Cloudflare to a FedRAMP-authorized inference path — either Cloudflare for Government's offerings (where authorized for the workload's impact level) or a different authorized provider routed via the same gateway pattern. R2 moves to a FedRAMP-authorized object store (S3 on GovCloud, or a Cloudflare for Government offering where available) with the same access semantics. Workers move to a FedRAMP-authorized execution environment with the same region constraints. Cloudflare Access moves to Cloudflare for Government's authorized SSO/ZTNA stack.

The lift is real work — typically a quarter of focused effort for a mid-sized pilot — but it is migration work, not architecture work. The team that built the pilot can do the lift. The agency does not need to procure a separate vendor to "do it properly" for production. The IP transfer the agency negotiated under M-25-22 is meaningful because the architecture is portable by construction.

The reason this matters is that the alternative — building the pilot directly on a FedRAMP-authorized stack from day one — is what most agencies have been asked to procure historically, and it does not work at pilot scale. The cost of the authorized environment is too high to justify against an unproven feature, the development velocity is too slow to validate the feature in time, and the procurement timeline to even start work is measured in quarters. The FedRAMP-aware commercial pilot is the practical answer to the question "how do we evaluate an AI capability without committing to a multi-year ATO program before we know if the capability is worth it?"

## What goes in the SOW

The contracting officer reading the SOW for a FedRAMP-aware pilot is going to look for specific language, and the language should be precise.

**Scope:** the workload, the data classification, and an explicit statement that the engagement is a pilot/prototype on a commercial-aware stack, not a production deployment on a FedRAMP-authorized environment. **Data handling:** what data the system processes, what data leaves the agency, what data is stored and where, the retention schedule, the destruction protocol at engagement end. **Audit posture:** the audit event schema, the immutability guarantee, the query interface, the access control model for the audit trail. **Region constraints:** which components run in which regions, with the Wrangler/deployment manifest excerpts referenced in the SOW. **Access controls:** the SSO integration, the MFA enforcement, the role model, the policy language. **IP transfer:** the code, the deployment manifests, the audit schema, the operator runbook — what the agency receives at end of engagement. **Lift path:** an explicit statement of what the lift to a FedRAMP-authorized environment would entail if the agency chooses to proceed to production.

That is what a memo-compliant federal-aware pilot SOW looks like. None of it is optional. All of it is what the OMB acquisition memos asked for, applied to the specific engagement class where commercial-stack agility is the only way to actually evaluate the capability before committing to an authorization investment.

## The reference architecture, one paragraph

AI Gateway as the single policy boundary for every model call, with comprehensive logging. R2 as immutable audit storage, append-only by construction, lifecycle-managed. Workers and Durable Objects region-pinned to US jurisdictions for any component handling non-public agency data, documented in deployment manifests. Cloudflare Access as the identity boundary for every operator surface, integrated with the agency SSO. Continuous structured audit emission as a first-class application output, designed in from day one. Architecture portable to FedRAMP-authorized SKUs at the lift moment without code changes.

Same primitives the commercial side runs on. Hardened, region-constrained, audit-emitting, identity-integrated, and contractually portable. The federal-aware engagement class has a reference architecture. This is mine.

---

Tony Adams is the founder of HotCopy and Truvisory®. He builds Cloudflare-native AI systems for federal and commercial clients. Verified SDVOSB and VOSB, SAM.gov-registered.

// Note on terminology: This essay uses "FedRAMP-aware" to describe a commercial deployment architected against the controls a future FedRAMP authorization would require. It does not represent authorized status. Agencies considering production workloads on this architecture should consult their authorizing official regarding the appropriate SKU and authorization path.

<li>Companion posts: <a href="/insights/omb-buying-spec/">OMB M-25-21 reads like a buying spec for fixed-scope AI</a>, <a href="/insights/durable-objects/">Durable Objects are the missing primitive for production agents</a></li>

### SDVOSB is leverage. Use it. — /insights/sdvosb-leverage/

(Older post — see permalink for full content.)

### A 30-minute AI Audit, scripted. — /insights/ai-audit-scripted/

(Older post — see permalink for full content.)

### Designing every agent against MCP first. — /insights/mcp-first/

(Older post — see permalink for full content.)

### Hyperdrive is the graceful exit from your legacy database. — /insights/hyperdrive/

(Older post — see permalink for full content.)