LLM Cost Control for SaaS: Token Budgets, Caching, and Graceful Fallbacks
A B2B note-taking SaaS I helped untangle in early 2026 had built a beautiful AI summarisation feature on top of a frontier-class model and priced it at €19 per seat. Two months after launch, finance noticed that gross margin on AI plans had fallen to 38%. One enterprise account was net-negative on every invoice. The founder's question was the right one: "How is LLM cost optimization not a solved problem yet?"
It is a solved problem. It is just one that most engineering teams discover after they ship, when usage spikes and cost reports start arriving in shapes finance has never seen before. By 2026, every B2B product with a meaningful AI feature has a token budgeting story, a prompt caching layer, a model routing policy, and a fallback runbook for when the provider has a bad afternoon. Teams that skipped any of those pieces are the ones doing emergency replatforms a quarter later.
This post is a practical guide to keeping LLM unit economics under control: how to design token budgets per tier without breaking the user experience, how to actually realise the savings prompt caching promises, how to route requests across cheap and expensive models, and how to fail gracefully when costs spike or providers go down.
The actual cost shape of an LLM feature
Before designing budgets, it is worth being honest about where the money goes. The bill for a typical RAG-backed SaaS feature has four layers.
Direct inference cost dominates: input tokens for the system prompt, retrieved context, and conversation history; output tokens for the model's response. In 2026, frontier-class models sit in the €10–25 per million output tokens range, while mid-tier models (GPT-4.1-mini, Claude Sonnet 4.5, Gemini 2.5 Flash and their peers) run €0.50–3 per million output tokens. The 10x–30x gap between tiers is the single most exploitable cost lever in your stack.
Embedding cost is usually rounded to zero by tutorials, but it becomes a steady drip in production — €0.02–0.10 per million input tokens at scale, paid on every reindex of every document, every time you change the chunking strategy.
Retry and tool-call cost is the real silent killer. A single chat turn that triggers two tool calls, a retry on a malformed JSON response, and an automatic context refresh can multiply spend by 4–6x over a clean turn. Most provider billing dashboards do not surface this. Your application's own observability has to.
Storage and infra — the vector database, caches, queues, the observability stack itself — round out the bill. Small as a percentage of total spend in early stage, but the part that should survive a cost optimisation pass intact, because it is what makes everything else cheap.
If your engineering team cannot break down last week's LLM spend across these four buckets per feature and per customer tier, every cost decision after that is a guess. A focused code quality audit on the AI layer usually finds two or three structural issues that explain 60% of unexplained spend within a day.
Token budgets per user tier
The first instinct most teams have is to cap raw token spend with a hard cutoff. That works for protecting the platform, but it ruins the user experience. A better model treats budgets as a resource the product surfaces, like API quota or storage — visible, predictable, and tier-shaped.
A workable design uses budgets per user that are computed from billing tier and refreshed on a fixed cadence. The two design rules are non-negotiable: account in money, not tokens — token costs change every quarter, prices vary by model, and the only number finance cares about is euro per user — and charge before the call, refund the unused portion on completion, because accounting after the fact lets a buggy retry storm blow through a budget while the meter is still catching up.
A minimal Symfony implementation:
// src/Service/LlmBudget.php
final class LlmBudget
{
public function __construct(
private readonly Connection $db,
private readonly Clock $clock,
) {}
public function reserve(string $userId, EurAmount $estimated): Reservation
{
$period = $this->currentPeriodFor($userId);
return $this->db->transactional(function ($tx) use ($userId, $period, $estimated) {
$row = $tx->fetchAssociative(
'SELECT remaining_eur FROM llm_budgets
WHERE user_id = ? AND period_id = ? FOR UPDATE',
[$userId, $period],
);
if ($row === false || $row['remaining_eur'] <= 0) {
throw new BudgetExhaustedException($userId);
}
$tx->executeStatement(
'UPDATE llm_budgets SET reserved_eur = reserved_eur + ?,
remaining_eur = remaining_eur - ?
WHERE user_id = ? AND period_id = ?',
[$estimated->cents(), $estimated->cents(), $userId, $period],
);
return new Reservation($userId, $period, $estimated, $this->clock->now());
});
}
public function settle(Reservation $r, EurAmount $actual): void
{
// Refund the difference between reserved and actual on completion.
$delta = $r->estimated->cents() - $actual->cents();
$this->db->executeStatement(
'UPDATE llm_budgets SET reserved_eur = reserved_eur - ?,
spent_eur = spent_eur + ?,
remaining_eur = remaining_eur + ?
WHERE user_id = ? AND period_id = ?',
[$r->estimated->cents(), $actual->cents(), $delta, $r->userId, $r->period],
);
}
}
Tier shape matters too. The pattern that scales: free tier gets a small monthly budget (€0.50) sufficient for a daily-active demo experience. Standard gets enough budget for the documented use case plus 50% headroom (typically €5–15 per seat per month at SaaS pricing). Power tier doubles that and surfaces an upgrade prompt instead of a hard block. Enterprise customers get a usage-based meter with an alerting threshold rather than a cap. Each tier is engineered to make the next tier obviously better at the moment of friction, not punitive at any point.
Prompt caching that actually saves money
Most teams enable prompt caching by adding a cache flag to a system prompt and assume they have done the work. Real savings come from designing the prompt structure for caching, not flipping a switch.
Three principles separate teams that get 70%+ savings on cached calls from those that get 5%.
Stable prefixes, dynamic suffixes. The cache key is effectively a prefix hash. Anything that varies — user id, timestamp, session token, retrieved chunks — must come after the stable bulk of the prompt (system instructions, role definitions, few-shot examples, glossary). A common anti-pattern is interleaving user context with system instructions; that defeats the cache for every request and makes the bill linear in token volume again.
Long enough to matter. Most providers cache only when the cached portion crosses a minimum token threshold (around 1024 tokens at the major providers in 2026). Teams with short, terse system prompts often pay full price even with caching enabled. Padding the prompt with high-quality instructions and few-shot examples — which improves quality anyway — is usually the right call.
TTL alignment. Caches expire on a 5–15 minute idle window with most providers. Workloads with bursty access patterns benefit dramatically; sparse, long-tail workloads benefit far less. If your traffic shape is sparse, batch: group N independent requests into a small worker that processes them within a single TTL window. The infra cost of the queue is trivial against the inference savings.
The practical impact, in numbers I have measured on real workloads this year: a well-structured RAG prompt with 3,000 tokens of system + retrieval context, called by 50 users in a 10-minute window, drops from roughly €0.12 per call to €0.018 per call once the cache warms. That is not a rounding error — it is the difference between a viable feature and a feature that ships off-roadmap a quarter after launch.
Model routing: cheap first, escalate on failure
The single largest cost saving in most AI features is the simplest pattern: try the cheap model first, escalate to the expensive one only when the cheap one is not good enough.
The implementation has three pieces. A router that picks the model. A quality probe that detects when the cheap model failed. An escalation policy that decides when to retry, when to escalate, and when to give up.
A minimal router for a Symfony Messenger handler:
final class LlmRoutingHandler
{
public function __invoke(GenerateSummary $msg): void
{
$ladder = ['gpt-4.1-mini', 'gpt-5']; // cheap → expensive
$attempts = [];
foreach ($ladder as $model) {
try {
$result = $this->llm->complete($model, $msg->prompt);
} catch (ProviderException $e) {
$attempts[] = ['model' => $model, 'error' => $e->getMessage()];
continue;
}
$quality = $this->qualityProbe->score($result, $msg->expectations);
$attempts[] = ['model' => $model, 'quality' => $quality->value()];
if ($quality->isAcceptable()) {
$this->bus->dispatch(new SummaryReady($msg->id, $result, $model, $attempts));
return;
}
}
$this->bus->dispatch(new SummaryFailed($msg->id, $attempts));
}
}
The quality probe is the load-bearing piece. For structured outputs, it is a JSON-schema validator plus a length sanity check. For free-form generations, it is usually a small classifier — a fine-tuned embedding model with a calibrated threshold runs at near-zero cost and catches the most common cheap-model failure modes (refusal, hallucinated structure, language drift, truncation). For agentic workflows, the probe checks whether the requested tool calls were made and whether they returned successful results.
A useful rule of thumb from production data: in well-instrumented SaaS, 70–85% of requests are handled adequately by the mid-tier model. Routing at that hit rate against a 10x price gap is a 60–70% reduction in inference spend with no perceptible quality loss for most users. The remaining 15–30% of requests get escalated to the expensive model, and because the escalation is data-driven rather than blanket-applied, the cost stays bounded even under traffic spikes.
Graceful fallbacks for the bad days
Cost spikes and provider outages arrive together more often than you would think. The fallback layer is what keeps the product useful — and the bill bounded — when either happens.
Three patterns earn their keep.
Provider failover. Every serious production stack in 2026 routes across at least two providers — typically OpenAI plus Anthropic, plus a self-hosted option for sensitive payloads. The router treats them as a ladder for availability, not just for cost. A seven-minute Anthropic outage stops being a customer-impacting incident if the router has already shifted traffic to OpenAI within the first 30 seconds of elevated error rates. The cost of the second integration is paid back the first time the primary provider has a bad hour.
Static fallbacks for high-traffic prompts. Some product surfaces — onboarding tours, default summaries, empty-state suggestions, sample outputs in marketing pages — do not actually need a fresh inference per request. Pre-generating a small library of variations and serving them from cache when the budget is tight or the provider is degraded keeps the UX intact at zero marginal cost. Most teams discover after they ship this that 20–30% of their LLM calls were always going to land on the same set of inputs anyway.
Graceful degradation messages. When a budget is exhausted and there is no static fallback available, the product needs to tell the user what happened in a way that does not look like a bug. "AI summarisation is paused for the rest of this billing period — upgrade for unlimited, or it resets in three days" is a much better customer experience than a spinning loader, and it converts. The error path is a product surface, not a 500.
The fallback policy belongs in code, not a runbook. A small state machine — provider_healthy, provider_degraded, budget_warning, budget_exhausted, static_only — driven by health probes and budget reads gives the front-end a predictable contract to render against, and gives oncall a single graph to look at when something goes sideways.
What to ship first
For a SaaS team that just discovered last quarter's LLM bill is uncomfortable, the work sequences cleanly. Instrument first — per-feature, per-user, per-model attribution into your warehouse. Then add money-denominated budgets per tier, with reservations and refunds, not just spend logging. Then ship a model router with a quality probe, because that is the single biggest cost saving and it does not require any prompt redesign. Then move to prompt caching once the routing data tells you which prompts are stable enough to cache. Then add provider failover. Each piece compounds the previous; skipping the instrumentation step makes everything that follows a guess.
If the AI feature was bolted onto a pre-existing PHP or Symfony product without a clean abstraction layer, this is also the moment to design one — a single LlmGateway that owns budgets, routing, caching, retries, and observability, with the rest of the product calling it through one interface. That kind of architectural cleanup pairs well with custom software development work, because the AI gateway is exactly the surface where business logic, infrastructure, and pricing meet.
The strategic frame is straightforward. AI features that ship without a cost story are loans the engineering team is taking out against next quarter's margin. The teams that quietly win in 2026 are the ones who treat LLM unit economics as a first-class product concern from the first commit, not a fire to be put out at the end of the runway.
Wolf-Tech helps European SaaS teams ship AI features with predictable gross margin — token budgeting, prompt caching, model routing, fallback architectures, and observability across PHP/Symfony backends and Next.js frontends. Contact us at hello@wolf-tech.io or visit wolf-tech.io for a free consultation.

