Why Federal AI Pilots Stall — and How to Scope One That Actually Ships
Your AI pilot is not going to stall because the model underperforms. It is going to stall because nobody named the production owner, the data was synthetic, and the ATO clock never started. In a federal environment, the model is the part that works. Everything around it is the part that doesn’t.
If you’re a program manager (PM), you’ve probably already lived a version of this. A vendor demo dazzles. A pilot gets funded with end-of-year money. The model works on a clean sample. Users like it. Then, twelve to eighteen months on, the pilot hasn’t shipped, the contractor has rolled off, and next year’s budget is using the last pilot’s slides to justify a new one.
This isn’t another retelling of the 95% statistic. The headline enterprise numbers — MIT NANDA’s finding that 95% of generative-AI pilots produce no measurable return, RAND’s finding that more than 80% of AI projects fail — are the on-ramp, not the destination. Federal AI fails for federal reasons, and a PM is uniquely positioned to prevent it, if the pilot is scoped right before the first contract action. Below are the six mechanics that actually kill federal pilots, the data behind each, and a diagnostic you can run before you sign. (For the broader modernization picture, see the federal AI pillar; for getting the work on contract fast, see how the FAR Part 19 overhaul lets you direct-award AI and the $5M SDVOSB sole-source how-to.)
What the federal data actually shows
OMB’s 2024 consolidated federal AI use-case inventory tags every entry with a lifecycle stage — Initiated, Acquisition/Development, Implementation and Assessment, Operation and Maintenance, or Retired. OMB hasn’t published an aggregate stage breakdown across the full inventory, but the largest reporting agency has. HHS disclosed that of its 271 use cases, only 104 (38.4%) had reached operations and maintenance; 59 were still merely initiated, and 16 were already retired. VA’s record is harsher: of 227 use cases reported in 2024, 72 were marked retired a year later — including an EHR-modernization surveillance tool that was listed as “initiated” in 2024 and killed before it ever reached a user.
That is the number you won’t find in the vendor deck: at the agencies doing the most AI, the majority of use cases are not in production, and a meaningful share die on the vine. Federal AI use cases nearly doubled from 571 in 2023 to 1,110 in 2024 across the civilian agencies GAO reviewed, with generative-AI cases growing nine-fold. Volume is not the problem. Transition is.
Why does a federal AI pilot actually stall?
Six mechanics, each with a diagnostic question you can ask before you commit.
1. The ATO cliff
A pilot built in a sandbox on synthetic data inherits none of the security accreditation a production system needs. A typical agency Authority to Operate path runs 12 to 36 months and costs $500K to $4M; in FedRAMP’s entire 13-year history, only about 502 cloud services have ever been authorized — roughly 26 a year. The pilot that “works” in month three hits a wall the moment it needs to touch a production environment, because the authorizing official has to accept real risk on real data. This is exactly why DoD has pushed continuous ATO (cATO) — the 2022 DoD CIO cATO memo, the 2024 DevSecOps continuous-authorization guide, and the Army’s 2024 cATO pilots all aim to replace point-in-time approvals with continuous monitoring of an approved software factory.
Ask: Does the target environment have an ATO today, and can the AI workload inherit it — or are we starting the clock from zero?
2. CUI scope discovery
The pilot ran on a clean synthetic extract. Production needs controlled unclassified information — and the moment real CUI, PII, or PHI enters scope, the system boundary expands to NIST SP 800-171 (and CMMC Level 2 if it touches DoD CUI), the Privacy Act, and FISMA. GAO found that officials at 10 of 12 selected agencies said existing data-privacy policy could be an obstacle to adopting generative AI. The failure is rarely dramatic; it’s a pilot that quietly cannot graduate because the controls it was built under don’t cover the data it now has to process. The VA OIG flagged a live version of this in early 2026: VHA had authorized generative-AI chat tools for clinical use without coordinating through its patient-safety center — pilot-stage governance that didn’t scale to production.
Ask: Have we classified the production data correctly, and does the vendor’s control set actually cover it? (For the FedRAMP-side answer, see how to verify an AI contractor’s FedRAMP posture and the FedRAMP-aware edge stack.)
3. Legacy data and integration
The model works in the demo because the demo data was hand-cleaned. Production has to read from a decades-old system of record. GAO reported in July 2025 that the federal government spends over $100 billion a year on IT, and that about $83 billion — 79% — of planned FY2025 IT spending across the 24 major agencies goes to operating and maintaining existing systems, not building new ones. Treasury still runs core systems on COBOL and Assembly. VA’s Program Integrity Tool, paused in 2023 over data and code-logic problems, has left an estimated $665.5 million in revenue uncollected and a backlog of roughly 40 million community-care claims. An AI pilot that assumes clean API access to a system of record is, in practice, a bet on a separate modernization program GAO has already flagged as high-risk.
Ask: Can we get production-representative data — not a curated sample — into the pilot in the first 30 days?
4. No funded production path — “pilot purgatory”
This is the federal version of the trap. One industry survey found 41% of agencies running AI pilots but only 8% at scaled, measurable deployment.
The gap is structural: a pilot scoped as an experiment, with no named production owner and no operations-and-maintenance funding line, has nowhere to go when it succeeds. The policy response — GSA’s USAi platform and FedRAMP 20x — is explicitly trying to make the pilot-to-production path the easy one. But no platform changes the fact that a pilot without a named production owner is an experiment, not an investment.
Ask: Who is the named production owner on the receiving side, and what is their O&M funding line next budget cycle?
5. Color of money and procurement timing
A pilot funded with end-of-year research dollars (RDT&E, “3600”) cannot legally become an operational system without a separate appropriation in procurement (3010/3080) and a sustained O&M line (3400) — different colors, different obligation windows, different lawful purposes. DoD created Budget Activity 8, the software pilot color, precisely because Congress and the Defense Innovation Board concluded the traditional separation routinely dooms software projects. DoD’s broader transition gap — the “valley of death” — is funded against with programs like APFIT, which Congress resourced at $400 million in FY25 and made permanent in the FY25 NDAA. The lesson for a PM is timing: the contracting vehicle and the production funding have to be lined up before the pilot starts, not discovered after the model works. The vehicle mechanics themselves — sole-source and direct-award paths that move in weeks — are covered in the FAR Part 19 direct-award piece and the $5M sole-source how-to. For year-end timing specifically, see year-end funds, fast.
Ask: Do we have a line of sight to procurement and O&M dollars for production in next year’s budget, and is our contracting officer at the table now?
6. Workforce, adoption, and governance overhead
The system can be perfect and still fail if no one operates or adopts it. The IRS began 2025 with about 102,000 employees and ended with roughly 74,000 — a 27% reduction — and GAO found its analytics group lost 63 staff working on AI, now lists 126 active AI use cases with no plan to address the skills gap, and reports that more than a quarter of those use cases lack any information on expected benefits. Meanwhile OMB M-25-21 mandates Chief AI Officers, governance boards, public AI strategies, annual inventories, and minimum risk-management practices for “high-impact AI” coming due in 2026. Governance done well is the scaffolding that lets a pilot ship; treated as a separate workstream that delivers nothing the user sees, it’s just another way for the clock to run out.
Ask: Who owns adoption — not deployment — on day one of production, and how is their effort funded?
The pre-launch diagnostic: eight questions before you sign
Run this before the solicitation. If you can’t answer “yes” to most of these, you’re scoping a science experiment, not a production pilot:
- Named production owner. Is there a specific program office and named senior official who will own the system after the pilot, with budget authority?
- ATO path. Does the target environment have an ATO or a credible cATO posture? If not, what’s the realistic timeline (assume 12–36 months for a new agency ATO) and who pays?
- Production data on day 30. Is there a signed data-use agreement letting the pilot touch production-representative data in the first month?
- Color of money. Are the pilot and production dollars in compatible colors, or is there a known bridge?
- Mission-outcome metric. Is success a mission outcome — claims processed, time-to-decision, backlog reduced — not a technical metric like model accuracy?
- 30–90 day increments. Is the work a series of short, bounded deliveries that each end in working software for a real user — not a 12-month “discovery” engagement?
- High-impact assessment. If this is or could become high-impact AI under M-25-21, have you started the minimum-practices documentation now, not at the end?
- Vendor production track record. Has the vendor shipped to a federal or comparably regulated production environment — or is this their first ATO?
Three or more weak answers, and the pilot will stall. That’s not a prediction; it’s what two years of federal inventory data and OIG reports have been saying.
What pilots that ship look like
The pattern in the federal AI efforts that actually reach users is the same one: production-path-first scoping. DoD’s CJADC2 minimum viable capability, fielded across multiple combatant commands, runs on 90-day experimentation cycles with named end users and a funded transition path. DOE’s 2026 work with its national labs produced a 208-page nuclear reactor license application in a single day — work that normally takes a team of experts four to six weeks — because it was a small, bounded use case with a clear owner.
The shared logic: define the production target, the ATO route, the data, the user, and the O&M owner before the pilot starts. Deliver in 30–90 day increments that touch real users on production-representative data. Build to the security requirements from day one instead of retrofitting them at the end. And pick a vendor that has shipped into a federal or comparably regulated production environment — with a fixed scope and a definition of done that maps to a mission outcome, not a demo. The policy backdrop for this approach is in OMB M-25-21 reads like a buying spec.
Most stalled federal pilots stall because nobody was willing to say “no” early. The fix is to scope so the production decision is built in from the start.
Frequently asked
Why do federal AI pilots fail more than commercial ones?
What's the single biggest killer?
How do I know if my pilot is heading for purgatory?
Does the new policy environment help?
Working with Truvisory
The pilots that ship are fixed-scope, security-aware, and built for the production endpoint from day one. That’s the model Truvisory is built around: working software in federal-grade environments, delivered in 30-to-90-day increments — FedRAMP-aware architecture, shipped on Cloudflare’s edge.
If you’re scoping a pilot you actually need to ship, start with a scoping call — and see the FAR Part 19 direct-award and $5M sole-source pieces for getting it on contract in weeks, plus the federal AI modernization pillar for the full path.