Prompt Caching in Production: Cutting LLM Latency and Cost Without Stale Answers

#prompt caching
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

Prompt caching is the cheapest performance win in most AI features, and the easiest one to get wrong. Done well, it cuts the time-to-first-token on a repeated request from two seconds to under two hundred milliseconds and drops the input-token bill on that request by 80% or more. Done carelessly, it serves a user last week's answer to this week's question, leaks one tenant's data into another tenant's response, or quietly caches an error and replays it a thousand times. The difference is not the caching library. It is the discipline around what you key on, what you are allowed to reuse, and how you invalidate.

This post is a practical guide to running prompt caching in production: the two distinct things people mean by the term, where reuse is genuinely safe, how to construct a cache key that will not betray you, and how to invalidate without throwing away the savings you set out to capture.

Two different things people call prompt caching

The phrase covers two mechanisms with very different risk profiles, and conflating them is the root of most incidents.

The first is provider-side prefix caching. Anthropic, OpenAI, and Google all let you mark a stable prefix of your prompt, typically the system instructions plus retrieved context, so the provider keeps the computed attention state warm and bills the repeated prefix at a steep discount. You still make an API call, the model still runs, and the output is still freshly generated. This is almost pure upside: you are not reusing an answer, only the work of reading the same preamble. The only real constraint is ordering. The cached content has to sit at the front of the prompt and stay byte-for-byte identical, so anything volatile, a timestamp, a request ID, the user's latest message, belongs after the cache breakpoint, never inside the prefix.

The second is application-side response caching: you store the model's full output and return it directly on a later request without calling the model at all. This is where the latency and cost wins are largest, because you skip inference entirely, and it is also where every correctness and security failure lives. The rest of this post is mostly about doing the second kind safely, because the first kind rarely hurts anyone.

Where response caching is actually safe

The honest test for whether a response is cacheable is a single question: if two requests produce the same key, is it always correct to give them the same answer? If you cannot say yes without caveats, you do not have a cache, you have a bug with a hit rate.

Three categories pass the test cleanly. Deterministic transformations are the safest: classification, extraction, translation, and reformatting of a fixed input, run at temperature zero, will produce the same useful output every time, so caching them is just memoisation. Expensive idempotent enrichment is next: embeddings for a document chunk, a summary of an immutable PDF, or a generated description of a product that has not changed. These cost real money to compute and never need to change until the source does. Finally, shared reference answers that do not depend on the individual user, the kind of "explain this concept" or "what does this error mean" content that is identical for everyone, can be cached across your whole user base.

Two categories almost always fail the test. Personalised generation that folds in user history, account state, or permissions must never share a cache entry across users, and usually not across sessions, because the inputs that make it correct are exactly the inputs that make it unshareable. And anything time-sensitive, a request whose right answer changes as underlying data changes, is only cacheable if your invalidation is tied to that data, which is the hard part covered below.

If your team cannot articulate which bucket each AI call falls into, that is the first thing to fix, and it is usually where a focused code quality audit of the AI layer pays for itself within a day. The audit question is blunt: show me the cache key for each cached call, and prove two different users cannot collide on it.

Building a cache key that will not betray you

A cache key is a promise: everything that can change the correct answer is in the key, and nothing that should be private is shared by omission. Most cache incidents are a field that should have been in the key and was not.

A complete key for a response cache has four parts. The semantic input is the actual content: the normalised user query or the document hash, not the raw string, because trailing whitespace and casing should not fragment your hit rate. The model identity matters because GPT-4.1 and its successor do not produce interchangeable output, so the model name and version belong in the key, otherwise a model upgrade silently serves you yesterday's model's answers. The behaviour parameters, temperature, max tokens, the system-prompt version, and any tool configuration, change the output and so change the key. And the tenancy and permission scope, the tenant ID and a role or permission fingerprint, is the field people forget and the one that turns a cache into a data-leak.

A defensive helper in Symfony makes the contract explicit:

// src/Service/PromptCacheKey.php
final class PromptCacheKey
{
    public function build(CacheableRequest $r): string
    {
        $parts = [
            'v3',                          // bump to invalidate everything
            $r->tenantId,                  // never share across tenants
            $r->permissionFingerprint,     // role / scope hash
            $r->modelId,                   // includes model version
            (string) $r->temperature,
            $r->systemPromptVersion,
            hash('sha256', $this->normalise($r->input)),
        ];

        return 'llm:' . hash('sha256', implode('|', $parts));
    }

    private function normalise(string $input): string
    {
        return trim(preg_replace('/\s+/', ' ', mb_strtolower($input)));
    }
}

The leading v3 is deliberate. A global version segment in the key is the cheapest invalidation lever you own: when you change a prompt in a way that should not reuse old answers, you bump it and the entire cache ages out without a flush. Make the tenant ID non-optional in the constructor so it is impossible to build a key that omits it. A cross-tenant leak is not a bug you want to discover from a support ticket.

Invalidation without losing the savings

Caching is easy. Invalidation is the engineering. The instinct to set a short time-to-live on everything destroys most of the benefit, because a five-minute TTL on a document summary that never changes means you pay for inference again every five minutes for an answer that was already correct. The better model is to invalidate on events, not on a clock.

Tie each cache entry to a version of its underlying data. When a document is reindexed, increment a version counter for that document and include it in the key, so the old entries become unreachable the instant the source changes and are reclaimed lazily without a flush. Use a TTL only as a backstop against entries you forgot to invalidate, not as your primary correctness mechanism, and set it generously, hours or days, for genuinely immutable content. For the rare answer that is correct now but will not be later and has no clean event to hang off, prefer a short TTL plus a stale-while-revalidate pattern: serve the cached answer immediately, refresh it in the background, so the user gets speed and the next user gets freshness.

Two failures deserve explicit handling. Never cache errors or empty responses, or a single provider hiccup gets amplified into a thousand replayed failures; cache only validated, complete outputs. And protect against the stampede where a popular entry expires and a hundred concurrent requests all miss and all hit the model at once, which you defuse with a short lock so the first miss computes while the rest wait briefly or serve stale. These patterns are standard in HTTP caching and translate directly; teams already running a mature web application usually have the primitives in their stack and only need to point them at the LLM layer.

A pragmatic rollout

Resist caching everything on day one. Start with the calls that are unambiguously safe and expensive: embeddings, summaries of immutable documents, and temperature-zero classification. Instrument hit rate, latency saved, and cost saved per call type before you expand, because a cache with a 4% hit rate is adding complexity for nothing and should be removed, not tuned. Add personalised or time-sensitive caching only once the simple wins are measured and the invalidation story for that data is written down, not improvised.

The teams that get durable value from prompt caching treat it as a correctness feature that happens to save money, not a cost feature that happens to risk correctness. The key construction and the invalidation rules are the product; the cache store is an implementation detail. If you are building or hardening an AI feature and want a second pair of eyes on where caching is safe and where it is quietly serving stale answers, that is exactly the kind of review we do as part of custom software development work. Reach us at hello@wolf-tech.io or at wolf-tech.io, and bring your cache keys; they tell us most of what we need to know.