Structured LLM Outputs in Production: JSON Mode, Tool Use, and Validation Patterns

May 1, 2026#structured LLM outputs

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

Structured LLM outputs are where most production AI features quietly break. A B2B onboarding flow I reviewed this spring used an LLM to extract company information from free text and write it directly into a typed database row. It worked beautifully in staging. Twelve hours after launch, the operations team found 4% of new accounts in a half-broken state — missing fields, integers parsed as strings, an industry column that occasionally contained the word "various" wrapped in two layers of unmatched quotes. The model had not regressed. The traffic shape had changed, and the application was treating LLM output the way it treated a typed RPC response.

Reliable structured output is the load-bearing wall behind every serious LLM feature in 2026: data extraction, classification, function calling, agent tool use, form autocompletion, RAG citations, every workflow where the response leaves the model and goes straight into application code. Most teams ship the happy path, discover the failure modes the hard way, and bolt on validation under pressure. The patterns that hold up are not exotic — they are the same defensive engineering that keeps any untrusted input from corrupting state. The trick is treating the model as untrusted input, even after the provider tells you JSON mode is "guaranteed".

This post is the production playbook: when to use JSON mode versus tool use, how to design schemas that the model can actually fill, the retry and validation patterns that recover gracefully, how to stream partial outputs without losing safety, and the observability that lets you catch quality regressions before customers do.

JSON mode versus tool use: pick the right primitive

Both major LLM providers offer two distinct ways to coerce a model into structured output, and they are not interchangeable. Picking the wrong one is the most common architectural mistake in early LLM features.

JSON mode (also called "JSON schema mode" or "structured outputs") instructs the model to emit a single JSON object that conforms to a schema. The model produces one response. You parse and validate it. Use this when the task is fundamentally a transformation — extracting fields from text, classifying into a fixed taxonomy, summarising into a structured report, generating typed configuration. The shape of the output is known in advance and there is exactly one of it.

Tool use (also called "function calling") presents the model with one or more named functions, each with a typed parameter schema, and the model decides which to call and with what arguments. Use this when the task is fundamentally an action — the LLM is a controller deciding what to do next. Tool use is also the right primitive when the response is a list of operations rather than a single payload, when the model needs to refuse to act ("I do not have enough information"), or when the agent might need multiple turns.

The mistake most teams make is using tool use for what is really an extraction problem. They define a single function called extractCompany and force the model to "call" it. The result is more tokens, more latency, and a more brittle integration than JSON mode would have given them. The opposite mistake — packing branching logic into a JSON schema with a command discriminator field instead of giving the model real tools — is just as common and even harder to debug.

A simple rule that holds up: if the answer is "produce one shape", use JSON mode. If the answer is "decide what to do, possibly more than once", use tool use. The cost of getting this right at design time is zero; the cost of changing it after launch is a refactor through every consumer of the response.

Design schemas that the model can actually fill

Schema quality is the single biggest predictor of structured-output reliability, and it has very little to do with what the schema validator accepts. The model fills schemas in roughly the order it sees the fields, and it is biased toward producing something for every required field, even when the input does not support a confident answer. Schemas designed without that constraint in mind generate hallucinated content under load.

Four design rules earn their keep.

Make optionality explicit. Every field that might be absent in the input should be nullable or have a documented default, with the description telling the model when to use it. A required industry field with no null option will be filled — sometimes with "unknown", sometimes with a hallucinated guess, sometimes with the word "various". A nullable industry with the description "the company's primary industry; null if not stated in the input" produces clean nulls about 90% of the time on real data.

Constrain enums, do not suggest them. A field whose description says "one of: B2B, B2C, marketplace, internal" will receive any of those four — and "B2B/B2C", "marketplace (B2B)", and "Saas (B2B)" on a meaningful percentage of inputs. A field whose schema is enum: [B2B, B2C, marketplace, internal] produces exactly one of the four. Use the validator, not the description, to enforce closed sets.

Order fields like a form. Models perform better when later fields can reference earlier ones. Put the discriminating fields first (entity type, language, primary key) and the dependent fields after them. A schema that asks for address before country produces worse address parsing than one that asks for country first.

Add a confidence escape hatch. Add a notes or confidence field, even if your application does not need it. The model uses it as a relief valve for ambiguous inputs and produces cleaner data in the structured fields as a result. Log the field. Throw it away if you must, but include it.

A worked example for the company-extraction case:

// Zod schema, generated to JSON Schema for the provider call.
const CompanyExtraction = z.object({
  legal_name: z.string().min(1),
  trading_name: z.string().nullable(),
  country: z.enum(['DE', 'AT', 'CH', 'FR', 'NL', 'GB', 'US', 'OTHER']),
  industry: z.enum([
    'fintech', 'healthtech', 'ecommerce', 'real_estate',
    'education', 'travel', 'other',
  ]),
  employee_band: z.enum(['1-10', '11-50', '51-200', '201-1000', '1000+']).nullable(),
  source_quote: z.string().describe(
    'Verbatim sentence from the input that supports this extraction.'
  ),
  confidence: z.enum(['high', 'medium', 'low']),
  notes: z.string().nullable(),
})

The source_quote field is doing real work. Forcing the model to cite the input sentence it used reduces hallucination measurably, because the model has to commit to an answer that is checkable against the source. Validation at the application layer can compare the quote to the input and reject the response if the quote does not appear, catching a class of hallucinations that no schema validator alone will catch.

Validation, retries, and the safety net

Even with a clean schema and JSON mode enabled, expect a malformed-output rate between 0.1% and 5% in production, depending on the model, the prompt, and the input distribution. The job of the application is to make that rate invisible to the rest of the system.

The pattern that holds up is a three-layer net.

async function extract(input: string): Promise<Company> {
  for (let attempt = 0; attempt < 3; attempt++) {
    const raw = await llm.json(prompt(input), CompanyJsonSchema)

    const parsed = CompanyExtraction.safeParse(raw)
    if (!parsed.success) {
      logger.warn('schema_validation_failed', { attempt, issues: parsed.error.issues })
      continue
    }

    if (!parsed.data.source_quote || !input.includes(parsed.data.source_quote)) {
      logger.warn('source_quote_check_failed', { attempt })
      continue
    }

    return toDomain(parsed.data)
  }

  throw new ExtractionFailed(input)
}

Layer one is schema validation — the JSON parses and conforms to the declared types. Layer two is semantic validation — application-specific checks that the schema cannot express, like the source-quote check above, business rules ("an enterprise plan implies more than 50 employees"), or cross-field constraints. Layer three is bounded retry — at most two or three attempts, with the failed output included in the retry prompt so the model can self-correct. After the bound, the request fails cleanly with a typed exception that the calling code knows how to handle.

Two things matter beyond the structure. The retry prompt should be specific — "the previous response failed validation: industry was 'fintech-saas' which is not in the allowed enum. Return only one of: fintech, healthtech, ecommerce, …" — because vague retries waste tokens and rarely succeed. And the failure path must be a real product surface, not a 500. A graceful fallback that says "we could not extract this automatically, please confirm the fields manually" is a better customer experience than an opaque error and a stuck workflow. A serious code quality audit of an LLM feature usually finds at least one place where the failure path is missing or wrong.

Streaming partial outputs without losing safety

Streaming is where structured-output safety usually breaks. The user wants to see the response appearing as it is generated; the application wants to validate the response before acting on it. Both are reasonable; reconciling them takes some care.

The pattern that works is render streaming, validate on completion. The frontend receives the streamed tokens and renders a progressive view (typewriter effect, partial JSON pretty-printer, skeleton form filling out field by field). The backend buffers the entire response, validates it once, and only then commits any side effect — database write, downstream API call, payment authorisation. The user sees the streaming UI; the system never acts on a half-finished response.

For the JSON-shaped case specifically, partial parsing libraries (partial-json, jsonrepair, streaming-json-parser) let the frontend render an incomplete object as it streams. They are display-only — never feed their output into business logic. The validated, completed response from the backend is the source of truth for state changes.

Tool use streams differently. The provider streams the function name first, then the arguments. A reasonable pattern is to render an "the assistant is calling lookupOrder…" indicator immediately, then render the result of the actual tool execution once the arguments arrive, parse, and validate. The user sees responsiveness; the system sees only fully-formed, validated tool calls.

Observability you will actually use

Structured-output reliability silently regresses when prompts change, when models are updated by the provider, or when input distribution drifts. The teams that catch regressions before customers do are the ones treating output validation as an observable signal.

Three metrics, per feature and per model, cover most of what oncall needs. First-attempt validation rate — the share of responses that pass schema and semantic validation on the first try. A drop here usually means a prompt change went wrong. Final success rate — the share of requests that ultimately succeed within the retry bound. A drop here means the retry prompt is no longer recovering. Field-level null rate — the share of responses where each optional field came back null. A spike in nulls on a field that used to be populated often points to model drift or a prompt regression.

Plus one product metric — manual correction rate — the share of extractions a human reviewer overrides. That is the ground truth on quality, and the only number that matters for telling product whether the AI feature is actually doing its job.

A small evaluation set in CI does the rest. Two hundred labelled examples covering the failure modes you have already seen, run on every prompt change, with a hard floor on first-attempt validation rate that blocks merges. That kind of guard is what keeps a production AI feature from regressing one prompt rewrite at a time.

What to ship first

For a team that just discovered last week's onboarding flow has an extraction-failure problem, the work sequences cleanly. Pick the right primitive — JSON mode for transformations, tool use for actions. Tighten the schema with explicit nulls, real enums, sensible field order, a source_quote and a confidence. Add the three-layer validation net with bounded retries and a graceful fallback path. Buffer streamed responses on the backend; render incrementally on the frontend; never commit on partial output. Instrument first-attempt and final success rates; run a small eval set in CI. Each piece compounds the previous, and none of them require a model change to deliver.

The strategic frame is the same one that applies to every other surface where untrusted input meets typed application code: validate at the boundary, fail loudly, recover gracefully, and never assume the data is what the contract says it is. The teams that ship reliable LLM features in 2026 are the ones treating model output that way from the first commit, not after the first incident.

If you are integrating LLM features into a Symfony or Next.js product and want a second set of eyes on the schema design, validation strategy, or observability before things get to production, that is exactly the kind of work our custom software development practice is built for. Contact us at hello@wolf-tech.io or visit wolf-tech.io — eighteen years of European software work, including substantial AI-integration engagements across PHP and TypeScript stacks, sits behind every recommendation we make.