LLM Observability on a Budget: The Minimal Stack for Debugging AI Features in Production
Your APM tool is telling you everything is fine. P99 latency is healthy, error rate is flat, the database is not sweating. Meanwhile, a customer just opened a ticket because your AI assistant confidently generated a refund amount that was €800 too high — and you have absolutely no idea why.
This is the central problem with LLM observability: the failure modes are semantic, not systemic. Standard monitoring tools were built to detect crashes, slow queries, and memory leaks. They are not built to detect a model that started hallucinating after a prompt template change or a context window that silently truncated the most important part of the user's input. If you ship AI features without a purpose-built LLM observability stack, you are flying blind.
The good news is that you do not need a six-figure Datadog contract or a dedicated ML platform team to get meaningful visibility. This post covers the minimum viable LLM observability stack for a small engineering team — what to capture, how to structure your traces, where to host it affordably, and which alerting rules actually predict user-visible incidents before your support queue fills up.
Why Standard APM Is Not Enough for LLM Features
When a traditional API endpoint breaks, the signal is sharp: an exception is thrown, a status code changes, latency spikes above a threshold. The debugging path is deterministic — stack trace, log line, offending query.
LLM failures do not work that way. A prompt that silently exceeded the model's context window returns a 200 OK with a plausible-looking response. A model that started over-refusing user requests after a system prompt edit shows no error rate change at all. Token costs that tripled because a template injection bug started including the entire user history on every call look fine in your APM dashboard until the AWS bill arrives.
What you actually need to observe is:
- The exact prompt payload sent to the model, including the rendered system prompt and any injected context
- The raw model output, before any post-processing
- Which model version, temperature, and parameters were used
- Token counts broken down by input, output, and — if applicable — cached tokens
- Which tenant or user triggered the call, so you can attribute cost and spot abuse
- The full trace of a multi-step agentic chain, including which tool calls were made and in what order
- Output quality signals: did the response parse successfully, did guardrails fire, did the user immediately retry
None of this appears in a conventional APM tool out of the box.
Choosing a Self-Hosted Observability Backend
For small teams, the two strongest options for self-hosted LLM observability are Langfuse and OpenLLMetry.
Langfuse is purpose-built for LLM tracing. It gives you a clean UI for browsing traces, a prompt management system with version history, and a dataset/evaluation workflow for running offline evals against captured traces. The self-hosted version is a single Docker Compose stack — PostgreSQL, Redis, and the application server. On a €20/month Hetzner VPS you can comfortably handle tens of thousands of traces per day. For a small engineering team shipping one or two AI features, this is more than sufficient.
OpenLLMetry takes a different approach: it is an OpenTelemetry-based SDK that instruments your LLM calls at the library level (OpenAI SDK, Anthropic SDK, LangChain, and others) and emits standard OTEL spans. This means your LLM traces land in whatever backend you are already using — Jaeger, Grafana Tempo, Honeycomb, or even a self-hosted OpenObserve instance. If you already have an OTEL pipeline, OpenLLMetry is often the lowest-friction path because you are not adding another data store.
If you are starting from scratch and want the richest LLM-specific UI, go with Langfuse. If you have existing OTEL infrastructure and want LLM traces to live alongside your other traces, go with OpenLLMetry.
OpenTelemetry Span Conventions for LLM Calls
Whether you use Langfuse, OpenLLMetry, or a custom instrumentation layer, your LLM spans should follow the OpenTelemetry GenAI semantic conventions. The key attributes to capture on every span:
gen_ai.system # "openai", "anthropic", "mistral", etc.
gen_ai.request.model # exact model string including version
gen_ai.request.max_tokens
gen_ai.request.temperature
gen_ai.response.model # model actually used (may differ from requested)
gen_ai.usage.input_tokens
gen_ai.usage.output_tokens
gen_ai.usage.cache_read_input_tokens # for Anthropic prompt caching
gen_ai.finish_reason # "stop", "length", "tool_calls", "content_filter"
Beyond the standard attributes, you will want to add custom attributes for your own context:
app.tenant_id # for cost attribution per tenant
app.feature_name # which AI feature triggered the call
app.prompt_version # hash or version identifier of the prompt template
app.user_id # if relevant and permitted by your privacy policy
The app.feature_name and app.prompt_version attributes are the ones that will save you the most debugging time. When you see a spike in bad outputs, you can immediately filter by feature and prompt version to determine whether a template change caused the regression.
For a Symfony application using the Anthropic PHP SDK or an OpenAI-compatible client, you can add a middleware or event listener that wraps every LLM call in an OTEL span and populates these attributes automatically. This keeps instrumentation out of your business logic and makes it trivially easy to add new LLM calls without remembering to instrument them.
Prompt Payload Capture and PII Scrubbing
Capturing prompt payloads is the single highest-value action in your entire observability setup. Without it, debugging a bad output means guessing. With it, you can replay the exact interaction that caused the failure and patch it in isolation.
The problem is that prompts frequently contain personal data. A customer support AI receives messages that include names, order numbers, account details, and sometimes payment information. A document analysis feature processes contracts containing personally identifiable information. Capturing payloads verbatim and shipping them to a logging backend without appropriate controls creates a GDPR compliance problem.
The right pattern is to scrub at the boundary, before the payload leaves your application:
First, run a PII detection pass over the prompt before attaching it to the span. For structured fields (email addresses, phone numbers, credit card patterns), regex-based detection is fast and reliable. For unstructured text, a lightweight local model or a deterministic entity recognizer like spaCy can catch names and addresses.
Second, replace detected PII with typed placeholders: [EMAIL], [PHONE], [PERSON_NAME]. This preserves the semantic structure of the prompt — which is what you need for debugging — without storing the actual values.
Third, store the raw payload separately, encrypted at rest, with a short retention period (7–30 days), accessible only to engineering with an explicit audit log. This is your break-glass option for incidents that require the real content.
This two-tier approach gives you meaningful observability data in your trace UI while keeping your long-term storage clean and compliant.
Cost Attribution Per Tenant
Token cost attribution is frequently the second thing teams wish they had instrumented from the start (the first being prompt payload capture). Without it, you know your AI costs are €3,000 per month, but you have no idea whether that is driven by three power users hammering your assistant feature, a single tenant running automated bulk requests, or your own internal staging environment leaking into production traffic.
The mechanics are straightforward. On every LLM span, record the tenant ID. In a post-processing step — either a Langfuse annotation or a custom aggregation query — multiply token counts by the model's per-token price and sum by tenant. Expose this as a metric in your internal dashboards.
For Symfony applications, the cleanest implementation is a request-scoped service that holds the current tenant context. Your LLM middleware reads from this service and stamps every span automatically. If you are working on a multi-tenant SaaS platform, you can find more detail on tenant isolation patterns in our guide to multi-tenant SaaS architecture.
Two threshold alerts worth setting immediately: alert when a single tenant's daily spend exceeds 3× their 7-day average, and alert when total daily spend exceeds your monthly budget divided by 28. The first catches abuse or runaway automation; the second catches billing surprises before they become CFO conversations.
Tracing Multi-Step Agentic Chains
Single LLM calls are relatively simple to trace. Agentic features — where the model decides which tools to call, interprets the results, and calls additional tools or models — require nested trace structures to be understandable.
The OpenTelemetry model maps well here: a root span for the entire agent invocation, child spans for each LLM call, and further child spans for each tool call the model triggers. This gives you a complete picture of what the agent decided to do and in what order, which is essential for debugging failures in multi-step workflows.
The attributes that matter most at the tool call level are the tool name, the correlation ID from the model's response, the size in tokens of the result returned to the model, and how long the tool call took. Tool result size is particularly important. If your agent is calling a retrieval tool and the retrieved document is 8,000 tokens, that content goes back into the context window and is priced as input tokens on the next LLM call. Tracking result size per tool call lets you spot context bloat before it drives your costs up and your output quality down.
Alerting Rules That Actually Predict Incidents
Most teams start with obvious alerts: model API errors above a threshold, LLM call latency above some percentile. These are table stakes but they catch only the most obvious failures. The alerts that actually prevent user-visible incidents are subtler.
Finish reason drift: Alert when the proportion of calls finishing with length (truncated output) increases significantly compared to the 7-day baseline for that feature. Truncation means the model ran out of tokens before completing its response — a silent quality failure that often indicates your context window is growing unbounded due to a bug in conversation history management.
Content filter rate: Alert when the proportion of calls returning a content_filter finish reason increases. A spike here usually means a prompt template change inadvertently triggered the model's safety classifier, which users experience as the feature silently refusing to help.
Output parse failure rate: If your code expects structured output (JSON mode, tool use, or regex-extracted fields) and post-processing fails, log a span event for each failure. Alert when the parse failure rate for a given feature exceeds 5%. This is almost always a signal that the model's output format drifted, which often happens after a model version change.
Per-tenant call volume anomalies: A 3× spike over the 7-day average for a single tenant is a reliable abuse or runaway-automation indicator.
Token efficiency ratio: Track the ratio of output tokens to input tokens over time per feature. A sudden drop in this ratio often means your prompts grew without a corresponding increase in output quality, which is a leading indicator of over-stuffed context windows.
Putting It Together: A Practical Starting Point
For a Symfony or Next.js application shipping its first AI features, here is the order of operations that delivers the most value fastest.
Start with Langfuse self-hosted on a small VPS. Get prompt payload capture working in your staging environment first — you need to verify your PII scrubbing before it touches production data. Add the tenant ID and feature name attributes to every span. Set up the finish reason drift and output parse failure alerts. This is 80% of the value in two or three days of work.
Once that baseline is running, add cost attribution dashboards. This usually surfaces a surprise within the first week — a feature that looked inexpensive in testing turns out to be significantly more expensive under real usage patterns, often because production inputs are longer or more varied than your test data.
The final layer is output quality signals: guardrail fire rates, user retry rates (a user resubmitting the same question is a strong implicit quality signal), and any explicit feedback mechanisms you have in your UI. These take longer to calibrate but become your primary input for deciding when to switch model versions, revise prompt templates, or invest in fine-tuning.
Getting the Stack Right From the Start
Building an LLM observability stack is not optional if you are shipping AI features to paying customers. The failure modes are too subtle and the debugging experience without it is too painful. The minimum viable version — Langfuse self-hosted, OpenTelemetry spans with proper attributes, PII-scrubbed payload capture, and a handful of anomaly alerts — is achievable in a focused sprint and costs almost nothing to run.
If you are integrating AI features into an existing Symfony or Next.js application and want to get the observability layer right before you go to production, get in touch at hello@wolf-tech.io or visit wolf-tech.io. Getting instrumentation right from the start is dramatically cheaper than reverse-engineering it after a production incident.

