LLM Fallback Patterns: Keeping AI Features Up When the Model Provider Goes Down

#LLM fallback patterns
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

A logistics customer called us on a Tuesday afternoon because their AI-powered shipment classifier had stopped responding. Nothing in their code had changed. Their model provider was having a regional incident, requests were timing out after thirty seconds, and the feature that sorted incoming orders had simply gone dark. Support tickets piled up while an engineer refreshed a status page. The fix was not a better model or a bigger budget. It was a set of LLM fallback patterns that the feature should have shipped with on day one, so that a provider having a bad afternoon degrades into a slower experience rather than a dead one.

This is the uncomfortable truth about building on top of large language models: you have taken a hard dependency on an external system that will, at some point, return errors, throttle you, or answer so slowly that the answer is useless. Treating that as an edge case is how you end up explaining an outage you did not cause. Treating it as a design constraint from the start is how you build AI features that stay usable when the thing underneath them wobbles.

Why LLM fallback patterns are not optional

Traditional APIs fail in ways most teams already handle. A payment gateway times out, you retry, you show a friendly error. LLM providers add failure modes that are easy to underestimate. Rate limits arrive in bursts when your traffic spikes, which is exactly when you least want to drop requests. Latency is variable by design, so a call that usually returns in two seconds can take twenty under load. Context windows and token limits reject requests that worked fine yesterday because today's input is longer. And whole regions or model versions can be deprecated or taken offline with little notice.

The blast radius is also different. When an AI feature is core to a workflow, its failure is not cosmetic. The shipment classifier above was not a nice-to-have summary box in a corner of the screen. It sat in the critical path of order intake, so its outage became an operational outage. The more central the feature, the more its reliability has to be engineered deliberately rather than assumed. That engineering work is what separates a demo from a production system, and it is the same discipline we bring to any custom software development project where an external dependency sits on the critical path.

The fallback chain: primary, secondary, degraded

The core pattern is a chain of options ordered from best to acceptable. Your code attempts the first, and on failure moves to the next, until it reaches an option that always works because it does not depend on the provider at all.

A typical chain has three tiers. The primary is your preferred model, chosen for quality. The secondary is a different model, ideally from a different provider, so that a single vendor incident does not take out both links. The final tier is a non-AI fallback: a cached previous answer, a simpler rules-based result, or an honest message that hands control back to the user. The point of the last tier is that it has no external dependency, so the chain can always terminate in something the user can act on.

Provider diversity is what makes the middle tier worth having. If your primary and secondary models are both hosted by the same vendor, a regional outage takes both down together and your chain collapses to the degraded tier immediately. Routing the secondary through an independent provider, even one with a slightly weaker model, means most incidents are invisible to users because the secondary quietly absorbs them. Choosing which providers to pair, and how to keep prompts portable across them, is a tech stack strategy decision worth making before you are in the middle of an incident.

One caution: prompts are not perfectly portable. A prompt tuned for one model's quirks can produce weaker output on another. Keep a provider-neutral version of each prompt and test it against every model in the chain, so the secondary actually produces usable results rather than confidently wrong ones. Validate the structure of what comes back at each tier, because a fallback that returns malformed JSON is not really a fallback.

Retry logic that helps instead of hurts

Retrying is the first instinct when a call fails, and done badly it makes outages worse. Immediate retries against a provider that is already overloaded add load to a struggling system and can extend the incident. The discipline that prevents this is exponential backoff with jitter: wait a short interval before the first retry, double it for each subsequent attempt, and add a small random offset so that thousands of clients do not retry in lockstep and create a synchronised thundering herd.

Equally important is knowing what not to retry. A rate-limit response or a transient timeout is worth retrying because the next attempt may well succeed. A malformed-request error or an authentication failure is not, because the same input will fail the same way every time, and retrying just wastes the user's time and your budget. Read the error, classify it as transient or permanent, and only loop on the transient class. Cap the number of attempts and the total time you are willing to spend, because a retry loop that runs longer than the user will wait has failed even if it eventually succeeds.

A circuit breaker sits above all of this. When a provider has failed repeatedly in a short window, stop sending it traffic for a cooldown period and route straight to the next tier of the chain. This protects both sides. It stops you from hammering a downed provider, and it stops your users from waiting through a doomed retry sequence on every request. After the cooldown, let a small number of trial requests through to test whether the provider has recovered before you reopen the floodgates.

Graceful degradation is a product decision

The most important part of resilience is deciding, before an incident, what "still working" means for each feature when the AI layer is unavailable. This is a product question as much as an engineering one, and it deserves a real answer rather than a generic error page.

Graceful degradation usually takes one of a few shapes. A cached or recent result can stand in when freshness is not critical, so a user sees yesterday's summary rather than nothing. A simpler deterministic path can replace the model entirely, such as keyword rules where an LLM normally classifies, accepting lower accuracy in exchange for guaranteed availability. Or the feature can step aside cleanly and let the user do the task manually, which is far better than a spinner that never resolves. The classifier customer landed on a manual queue: when the model is unavailable, new orders drop into a list a human can sort, and the system tells the operator exactly that. Throughput drops, but the business keeps running.

Whatever shape you choose, be honest in the interface. A short, specific message that the AI assist is temporarily unavailable, paired with a working manual path, preserves trust. Silent failures and infinite spinners destroy it. The teams that handle this well decide the degraded experience deliberately and design for it, the same way they would design any other web application state, rather than discovering it live during an outage.

Make failure observable before it happens

You cannot manage failures you cannot see. Instrument every call to the provider with its outcome, latency, which tier of the chain served the response, and the token counts involved. This turns a vague sense that "the AI feels slow today" into a concrete signal you can alert on. When fallbacks start firing more often than usual, that is your early warning that a provider is degrading, often before their own status page admits it.

Set thresholds that trigger alerts on elevated error rates, rising latency, and unusual fallback frequency, and log enough context to reconstruct what happened after the fact. A short runbook that names who responds, how to force traffic onto the secondary provider, and how to verify recovery turns a scramble into a procedure. Building that observability and the resilience patterns underneath it is exactly the kind of work a code quality consulting engagement surfaces, because the gaps are rarely in the happy path and almost always in the error handling that no one tested under real failure.

Where to start

If your AI feature has no fallback today, you do not need to build all of this at once. Start with the single highest-impact step: add a non-AI degraded tier so the feature can never fully die, even if that tier is just a clear message and a manual path. Then add retry logic with backoff for transient errors, since that alone absorbs a surprising share of real-world blips. Add a second provider once the feature is important enough to justify the cost of keeping prompts portable. Layer in circuit breaking and observability as traffic grows. Each step is independently valuable, and the order keeps you protected against the worst outcome first.

The goal is not a perfect system that never sees a provider fail. Providers will fail, and that is outside your control. The goal is that when they do, your users experience a slower or simpler version of the feature instead of an outage, and your team gets a calm alert instead of a support flood.

If you are building AI features into a product and want a second set of eyes on the resilience design, or you are already living through the kind of afternoon described at the top of this post, we are happy to help. Reach out at hello@wolf-tech.io or read more about how we work at wolf-tech.io.