Skip to main content
Truvisory
Cloudflare-Native

Multi-Provider AI Routing on Cloudflare: Fallback, Retries, and BYOK That Keep Your App Up

Tony Adams 10 min read

If your AI feature calls one model from one provider, you have a single point of failure with a spinner attached. When that provider has a bad afternoon, so do you. One gateway changes that: route requests across many providers behind a single endpoint, fall back automatically when one fails, split traffic to test a new model, route by user tier or budget, and store provider keys centrally instead of scattering them through your code. A single-provider dependency becomes a routed, resilient setup — and you confirm which model actually served each request from the gateway’s own per-provider breakdowns.

This is the routing and reliability spoke of our AI observability, cost, and evaluation cluster. A fair note on where it fits: routing isn’t the cluster’s named third axis — it’s the enabling mechanism underneath the cost and latency levers (the cost spoke decides you want a cheaper model; the latency spoke decides you want a faster one; this spoke is how you actually send the request there), plus a reliability dimension of its own. This is where those two spokes route their mechanics.

Why route across providers at all

The foundation is abstraction. AI Gateway sits as a proxy in front of more than twenty providers — Workers AI, OpenAI, Anthropic, Google’s AI Studio and Vertex, Groq, Cerebras, Mistral, Cohere, DeepSeek, Amazon Bedrock, and a couple dozen others — reachable through one interface. You integrate it with a one-line base-URL change, and from there switching or combining models is largely a matter of changing a provider-prefixed model string (openai/gpt-4.1, anthropic/..., @cf/...) or pointing at a route. That portability is the anti-lock-in argument in concrete form: when one provider raises prices, degrades, or goes down, moving traffic elsewhere is a configuration change rather than an engineering project.

20+ providers
reachable behind one AI Gateway endpoint — so switching or combining models is largely a model-string change, not a rewrite. The provider count changes monthly; re-verify. — Cloudflare AI Gateway supported-providers docs

One thing to get right, because Cloudflare is mid-transition on it: there’s a newer REST API on Cloudflare’s own API host that’s recommended for new integrations, an OpenAI-compatible endpoint that still works but now carries a deprecation banner, and the older Universal Endpoint — the array-of-providers approach — which is deprecated in favor of Dynamic Routing for anything involving fallbacks, retries, or conditional routing. As of this writing, the way you invoke a named dynamic route is still through the compatibility endpoint (model: "dynamic/<route-name>"), even though it’s marked deprecated, because the REST API docs don’t yet cover route invocation — a transitional gap worth re-checking before you build against it.

Dynamic Routing: the routing spine

Dynamic Routing is the current, authoritative way to route. You compose a small flow — in a visual builder or as JSON — that evaluates conditions, enforces quotas, and chooses models with fallbacks, then publish it as a named, versioned route you can roll back instantly, all without redeploying your application. It is labeled Beta, so treat its specifics as subject to change.

A route is built from a handful of element types, and the whole model is easy to hold in your head once you see the pieces:

  • Start and End bracket the flow; End returns the last successful model response, or an error if nothing succeeded.
  • Conditional branches true/false on the request body, headers, or metadata — for example, route on whether a request’s metadata marks the user as paid or free.
  • Percentage splits traffic across up to five weighted outputs (plus an optional remainder), for A/B tests and gradual rollouts.
  • Rate Limit and Budget Limit cap traffic by request count or by cost over an interval, with a success output and an optional fallback output when the limit is hit.
  • Model names a provider and model and carries its own timeout and retry count, with a success output and an optional fallback used when the model fails after its retries or times out.

Rules branch on metadata — arbitrary key-value context (a user ID, an org, a plan tier) you attach to the request — and the response comes back with headers naming the model and provider that actually served it, so you can see which branch fired. Two prerequisites apply to every route: the gateway needs authentication enabled, and provider keys stored via BYOK.

A coordination note on the Budget Limit element, because it straddles two spokes: the rationale for capping spend, and the Unified Billing spend-limit mechanics, belong to the cost spoke. This spoke owns the routing flow that contains the Budget Limit node and the fallback path it triggers when the cap is reached. The cap is the cost spoke’s; the flow around it is here.

Fallback and the reliability patterns

Fallback in Dynamic Routing is per-node, which is more flexible than the old approach of listing providers in an array and trying them in order. Each Model node has its own fallback output that fires when that model errors or times out after exhausting its retries — so you build a chain by pointing a primary model’s fallback at a second model on a different provider. That difference matters: failing over from one OpenAI model to another OpenAI model doesn’t help when OpenAI itself is the problem; failing over to Anthropic or a Workers AI model does.

From those elements, a handful of operator patterns cover most needs. A primary-to-fallback chain routes around a provider outage. A percentage split distributes load or runs a canary — ninety percent to the proven model, ten percent to the candidate. Conditional routing sends paid users to a premium model and free users to a cheaper one, or separates environments and request types. A/B testing splits traffic between two models and picks the winner — but you pick it on measured quality, not on a hunch, which is the evals spoke’s job. And graceful degradation uses a budget or rate-limit node’s fallback to drop to a cheaper, smaller, or faster model under pressure rather than failing outright — the cost reasoning lives in the cost spoke, the latency reasoning in the latency spoke, and the wiring lives here.

Retries and timeouts

Before a fallback ever fires, retries handle the transient failures — the momentary 500, the brief timeout. Gateway-level automatic retries shipped in April 2026: you configure up to five attempts, a delay between them from a hundred milliseconds to five seconds, and a backoff strategy of constant, linear, or exponential, applied across the whole gateway and overridable per request. The per-request controls are headers — cf-aig-max-attempts, cf-aig-retry-delay, and cf-aig-backoff — plus cf-aig-request-timeout in milliseconds, which is measured against when the first part of the response arrives, so a streaming response isn’t cut off once it has started. Inside a dynamic route, the same timeout and retry settings live on each Model node.

The order is worth internalizing: retries are attempted first, and only when they’re exhausted does the fallback trigger. The dedicated request-handling configuration page now points to Dynamic Routing as the preferred home for this behavior, though the headers and the gateway-level toggle still work. The honest framing is that retries buy reliability at the cost of latency on the failure path — a retried-then-failed-over request is slower than a clean one — and aggressive retries can amplify load against a provider that’s already struggling. Tune them; don’t max them out by reflex.

Storing provider keys: BYOK

By default you pass a provider’s API key on every request. BYOK — branded “Store Keys,” and also Beta — lets you store those keys in Cloudflare instead, so your requests carry only a gateway authorization header and the provider key is injected at runtime. The keys are held in Cloudflare’s Secrets Store with encryption at rest, configured once in the dashboard, and you can store more than one key per provider and select among them with a cf-aig-byok-alias header — useful for separating development from production or staging a key rotation. The operational benefits are real: keys stay out of client code, rotation becomes a dashboard change with no redeploy, and stored keys are what enable the rate- and budget-limit nodes in a dynamic route.

It’s worth being precise about how three related features differ, because they’re easy to conflate. BYOK means your provider keys, stored with Cloudflare. Unified Billing is the alternative where Cloudflare provides the credentials and bills you through prepaid credits — convenient, but it carries a five-percent fee on purchased credits and currently covers a handful of providers, with Workers AI billed separately. And Zero Data Retention is a gateway setting that routes traffic through provider endpoints that don’t retain prompts and responses — but it applies only to Unified Billing requests using Cloudflare-managed credentials, not to BYOK, and it doesn’t control the gateway’s own logging. The deeper governance question — whether storing your keys with Cloudflare fits your compliance posture, and how the audit trail works — belongs to the audit and logging spoke; here the point is the operational mechanics.

Concrete patterns

Invoke a named dynamic route by passing it as the model — "model": "dynamic/support" — through the compatibility endpoint with a cf-aig-authorization header, since that’s still the documented path. Build a primary-to-fallback chain by giving a Model node a fallback output that points at a second Model node on a different provider, each with its own timeout and retries. Branch on tier with a Conditional node testing metadata.plan and routing the two outputs to different models. Override reliability behavior per request with cf-aig-max-attempts, cf-aig-backoff, and cf-aig-request-timeout. And once you’ve stored a key with BYOK, drop the provider’s own authorization header entirely — keep cf-aig-authorization, and add cf-aig-byok-alias only when you want a non-default stored key. Keep the route flows small and legible, fail over across providers rather than within one, and let the per-request model and provider headers confirm what actually happened.

The honest trade-offs

  • The routing and key features are Beta. Dynamic Routing and BYOK are both labeled Beta, so behavior, element types, and configuration can change — pin your understanding to a date and re-verify.
  • Fallback adds latency on the failure path. Retries plus a second provider’s call execute before a successful fallback returns; reliability isn’t free of time cost.
  • Provider differences are real. The compatibility layer normalizes request and response shape, but models differ in capabilities, token limits, and behavior, so a fallback model can answer differently from the primary. One specific thing to verify yourself: Amazon Bedrock requires AWS request signing, and sources disagree on how cleanly it works through the gateway — Cloudflare’s own docs indicate BYOK now handles the signing, while at least one third-party integration still lists Bedrock as unsupported for exactly that reason.
  • A fallback model may differ in quality. Decide fallback and A/B winners on measured quality, not convenience — that’s the evals spoke’s job.
  • BYOK means your keys live with Cloudflare. That’s a reasonable trade for most teams and a governance question for some; weigh it deliberately.
  • The gateway is a dependency in front of everything. It adds cross-provider redundancy, but it’s also a concentration point, and its routing heuristics are a control plane you don’t see inside — which matters for teams under strict audit.
  • Cloudflare’s routing is newer than the incumbents. Dedicated gateways like LiteLLM, Portkey, and OpenRouter have more routing mileage; some comparisons still describe Cloudflare as observability-and-caching without routing, which is now outdated but signals that its routing is less battle-tested for complex multi-provider setups.
  • Retries amplify load, and conditional routing adds complexity. More attempts mean more pressure on a struggling provider, and a branching route is more to reason about and debug than a single call.

Frequently asked

How do I fail over from one AI provider to another?
Build a Dynamic Routing flow where a primary Model node's fallback output points at a second Model node on a different provider; it fires when the primary errors or times out after its retries. Failing over to a different provider is the point — a same-provider fallback won't help during that provider's outage.
Is fallback automatic, or do I configure it?
You configure it, per node. Each Model node has its own optional fallback path, and gateway-level automatic retries handle transient failures first — retries are exhausted before a fallback triggers.
What's the difference between BYOK and Unified Billing?
BYOK stores your provider keys with Cloudflare and injects them at runtime; Unified Billing uses Cloudflare-managed credentials and prepaid credits (with a five-percent credit fee, covering a handful of providers). Zero Data Retention is separate again and applies only to Unified Billing traffic, not BYOK.
Can I A/B test two models?
Yes — a Percentage node splits traffic across weighted outputs to different models. Choose the winner on measured quality rather than impression, which the evals spoke covers.
Does routing through the gateway add latency or risk?
Retries and fallbacks add latency on failure paths, and the gateway is a dependency in front of all your providers even as it adds redundancy across them. Both are real; weigh the cross-provider resilience against the single-control-plane concentration.

Working with Truvisory

If a provider outage would take your AI feature down with it, see how we build routed, resilient AI systems on Cloudflare — with automatic fallback across providers, tuned retries, and centrally managed keys, so one provider’s bad day isn’t yours.

Truvisory is a Denver-based AI and automation consultancy run by a senior operator — a combat veteran and former PE-backed operating executive — who ships working software, not strategy decks. Cloudflare-native by default, for both AI delivery and the back-office automation where the ROI lives.