Skip to main content
Truvisory
Federal

VA RAG Policy Assistants: Grounded, Cited Answers — Not a Chatbot

Tony Adams 7 min read

The VA’s own inspector general put the problem in writing in January 2026: VA GPT and the department’s other internal chat tools “do not have access to web search; therefore, the chat tools’ knowledge base is not current.” A VA rater or call-center rep can ask the in-house chatbot a policy question and get a fluent, confident, possibly wrong answer with no way to check it. The fix isn’t a better chatbot — it’s a different kind of tool: a grounded, cited retrieval-augmented generation (RAG) assistant that answers a policy question by pulling the exact M21-1 section and 38 CFR cite and linking them back to the live source. It suggests with a citation; the human verifies and decides. That distinction is the whole pitch.

This is the second capability spoke under the VA AI modernization pillar, and the inverse of its sibling: document automation processes the VA’s inbound documents, while a policy assistant retrieves answers from the VA’s own body of rules. It’s written for a VA program owner who runs a “nobody can find the right answer” problem, and for an SDVOSB deciding what capability to lead with.

What is a VA RAG policy assistant, exactly?

It’s a system that answers a question by retrieving the relevant policy text first, then generating an answer constrained to what it retrieved — with the citation shown inline. Ask “Does the anti-pyramiding rule at 38 CFR 4.14 apply to this combination of evaluations?” and it returns a short answer plus a link to the live §4.14 text and the governing M21-1 section, so the rater can verify in one click. It is not fine-tuning (which bakes knowledge into model weights and can’t cite) and it is not a raw LLM (which makes things up). The pipeline has two halves: an indexing side that ingests the corpus, chunks it, embeds it, and stores it in a vector database kept in sync with policy updates; and a query side that rewrites the question, retrieves and reranks the best passages, generates a cited answer, and abstains — says “I don’t have a confident source” — when retrieval comes back weak.

That abstention behavior is not a nicety. It’s the difference between a tool a risk-averse VA office will deploy and one it won’t.

Why is “grounded and cited” the whole game?

Because the alternative is dangerous, and the data says so. A Stanford study found general-purpose LLMs hallucinated on legal questions between 58% and 82% of the time — and, more soberingly, that commercial RAG-equipped legal research tools still hallucinated in roughly one of six benchmarking queries despite vendor claims of near-zero error. For VA policy, an answer without a verifiable citation should never reach an employee. But the lesson runs deeper: citation alone isn’t enough, because a system can cite a real source and still misstate it. That’s why a grounded VA assistant needs a measured evaluation harness — faithfulness, citation-validity, and retrieval-recall metrics run against a labeled question set — as a first-class deliverable, not an afterthought. The honest pitch to a VA buyer is never “our RAG doesn’t hallucinate.” It’s “we ground every answer in a cited source and we measure how often the citation actually supports the answer, and we show you the number.”

1 in 6
Hallucination rate on commercial RAG-equipped legal research tools (Lexis+ AI / Westlaw AI-Assisted Research) — citation alone is insufficient; a citation-validity harness is mandatory — Stanford HAI / RegLab, May 2024

Where’s the demand — and who feels it?

In three places, all documented.

VBA adjudication and training: raters work against 38 CFR (including the Schedule for Rating Disabilities at Part 4) and the 14-part M21-1 Adjudication Procedures Manual, a continuously updated internal reference the VBA itself has been reorganizing to make “more consumable and navigable.” New Veterans Service Representatives sit through a 6-week class; new Rating VSRs train for 11 weeks before rating a claim, with 80 hours of continuing training required each year. The cost of getting policy wrong is concrete: one OIG report attributed an estimated $72.5 million in improper payments to “inadequate procedural guidance,” and April 2025 congressional testimony described a single rater whose errors cost $2.2 million.

VHA clinical and IT policy: staff sit atop a publications database of roughly 437 active directives, handbooks, and notices, plus the 6500-series security handbooks. Contact centers: the VA answers more than 50 million calls a year across its call centers, with VHA clinical contact centers alone handling about 41 million in a recent year — a consistency-of-answer problem at massive scale.

This is also where VA GPT stops short. The internal tool has 95,000-plus users and saves them an estimated 2–3 hours a week on drafting and summarizing — but, per the OIG, its knowledge base isn’t current and it doesn’t retrieve from a versioned policy corpus. A specialized, cited policy assistant doesn’t compete with VA GPT; it fills the slot VA GPT explicitly leaves open.

What’s already been contracted?

The market is buying conversational and retrieval AI at scale, mostly through SPRUCE. The Oddcore JV (Oddball + Wilcore, both SDVOSBs) won a ~$15M AI-enabled VA.gov chatbot task (NLU/NLP-driven) and a separate ~$13.7M VA VoiceBot task, both on the SPRUCE SDVOSB-set-aside IDIQ. VA GPT runs through OCTO and the National AI Institute. And the AI Tech Sprint pathway — which turned an Abridge pilot into a ~$5.37M sole-source follow-on — is a proven on-ramp for a demonstrated capability. The best-fit NAICS is 541512 (computer systems design, $34M size standard), with 541519 (the code Oddcore won the chatbot under) and 541511 as alternates.

What about compliance — and is this “high-impact AI”?

Here’s where a policy assistant has a real, honest edge. Most of the source corpus — 38 CFR, M21-1, VHA directives — is public, which means a pilot scoped to public policy text carries a lower data-sensitivity bar than a claims-document pipeline full of PHI. On the architecture: a Cloudflare-native build maps cleanly (AutoRAG for the managed pipeline, Vectorize for retrieval, Workers AI for embeddings and generation, R2 for the corpus, D1 for citations). But the honest caveat, the same one that governs the rest of this cluster: Cloudflare for Government is FedRAMP Moderate today, with FedRAMP High targeted for its AI tooling in 2026, while the VA Enterprise Cloud requires FedRAMP High for sensitive data. So the compliant pattern is either (a) scope the assistant to public policy on the Moderate boundary, or (b) run it inside VAEC’s FedRAMP High enclave (AWS GovCloud or Azure Government). Truvisory builds to FedRAMP control families and VA Handbook 6500/6517 — FedRAMP-aware, and not CMMC-certified, because CMMC is a DoD program that doesn’t apply to VA work (more here).

On M-25-21: a policy assistant is probably not “high-impact AI if it’s operated as decision support — it suggests cited text, a human verifies against the source, and the human decides. If its output instead becomes the principal basis for a benefits decision without verification, it tips into high-impact and the full M-25-21 minimum practices attach (independent testing, monitoring, the ability to pause, waiver tracking). The citation-and-verification design is precisely what keeps the tool appropriately bounded — and a vendor who walks in with a draft high-impact analysis already written is far more credible than one who hasn’t thought about it.

What does a 90-day pilot look like?

Bounded to one corpus, one outcome. Pick a single body of policy — M21-1 for a VBA rater/training assistant is the natural first choice, since it’s public and the pain is acute. Weeks 1–2: corpus selection and ATO scoping (the VA’s 60-day accelerated AI ATO is the hook that makes this fit). Weeks 3–6: ingestion, indexing, and standing up the evaluation harness against a labeled question set drawn from real training questions. Weeks 7–10: a user pilot with 20–50 raters, measuring four numbers — citation-coverage rate, lookup-time reduction versus the current search, deferral rate, and a user-trust score. Weeks 11–13: hardening and handoff. If the three targets land (citation coverage ≥95%, lookup time down ≥30%, trust ≥4/5), you scale to a second corpus; if not, the buyer spent low-to-mid six figures, not eight, and learned where the corpus itself needs fixing.

Vehicle, in order of speed: the $5M Veterans First sole-source under VAAR 819.7008 (no Rule-of-Two finding required); the Tech Sprint Notice-of-Intent pathway; or similarly-situated subcontracting on SPRUCE or T4NG2. The Veterans First framing is in the Rule-of-Two guide.

Frequently asked

How is this different from VA GPT?
VA GPT is a general chat tool whose knowledge base, per the OIG, isn't current. A RAG policy assistant retrieves from a versioned, named corpus and cites every answer.
Does it decide claims?
No. It surfaces cited policy; a human verifies and decides. That design keeps it out of M-25-21 high-impact territory.
What about hallucinations?
Real, even with RAG. The mitigation is grounding plus a measured citation-validity harness — and showing the buyer the number.
Why is this lower-risk than other AI pilots?
Most VA policy is public, so a pilot scoped to public text avoids the PHI and FedRAMP-High burden a document pipeline carries.
What happens when the policy updates?
Versioned corpus snapshots and a change-detection pipeline — non-optional, because M21-1 changes often.

Working with Truvisory

Truvisory is an SBA-verified SDVOSB founded by a combat veteran, building grounded, cited RAG assistants — inline citations, abstention, and a measurable hallucination-evaluation harness — on a Cloudflare-native, FedRAMP-aware architecture, fixed-scope, in 90 days.

If you’re a VBA, VHA, or contact-center program owner with a “nobody can find the right answer” problem, we can stand up a single-corpus pilot with citation-coverage, lookup-time, and trust metrics as the deliverables — and an evaluation harness you can audit. Book a scoping call. For the procurement path, see the $5M sole-source guide and the VA AI modernization pillar; for the inbound-document inverse of this work, see document automation.

§ Cluster

More in this series