Structured Output Engineering

Advanced 🕐 22 min Lesson 13 of 16

What you'll learn

Choose the correct structured output mechanism for Claude, OpenAI, and Gemini based on failure rate requirements and schema complexity
Configure Claude tool use with forced tool choice to prevent natural language fallback
Implement constrained decoding for self-hosted models to achieve near-100% schema compliance
Design a three-tier error recovery pipeline (retry, schema repair, prose fallback) for production structured output systems

Why Structured Output Is a Distinct Engineering Problem

Asking a model to "respond in JSON" is not structured output engineering. It is wishful thinking. Models trained to produce natural language do not have a built-in JSON mode — they generate tokens probabilistically, and unless something constrains those tokens to valid JSON, they will occasionally produce JSON that is almost right but not parseable. In a pipeline that expects clean structured data, "almost right" is a failure.

Structured output engineering is the discipline of guaranteeing that model outputs conform to a schema — reliably enough to build production systems on. The state of this field in 2025 is excellent: all major providers now offer mechanisms that achieve <0.3% structural failure rates, and constrained decoding can achieve <0.1% for self-hosted models. The challenge is knowing which mechanism to use, on which platform, for which use case.

Provider Comparison

The three major providers use different approaches, with different failure rates and trade-offs:

OpenAI (Structured Outputs, strict JSON mode): Available on gpt-4o-2024-08-06 and later. Using the response_format parameter with type: "json_schema" and strict: true, the model is constrained at the token level to only output tokens that produce valid JSON matching your schema. This is grammar-constrained generation — the model physically cannot generate malformed JSON or tokens that violate the schema. Failure rate: <0.1%. Best for: APIs and data pipelines where 100% format compliance is non-negotiable.

Anthropic / Claude (Tool Use): Claude does not have a native JSON mode — its structured output mechanism is tool use (function calling). You define a tool with an input_schema in JSON Schema format, and Claude returns structured data as a tool_use block. Failure rate: <0.2% with Claude 3.5+. Key difference from OpenAI: Claude's tool use validates structure but does not enforce the same hard constraints as strict JSON mode. The upside: Claude's tool use is more flexible for complex schemas with conditional fields and nested structures.

Gemini (responseSchema): The response_schema parameter in the generation config accepts an OpenAPI-compatible schema. Gemini's structured output implementation is generally reliable (<0.3% failure rate) and works well for complex nested structures. Particularly useful when combined with Gemini's real-time search capability — you can extract structured data from current web information in a single call.

Working With Claude Tool Use

Claude's tool use pattern deserves detailed attention because it is the primary structured output mechanism for Claude and differs meaningfully from OpenAI's function calling.

The key structural difference: Claude uses input_schema where OpenAI uses parameters. Claude's tool use blocks include a tool_use_id that must be echoed back in a subsequent tool result. A minimal Claude structured output call:

tools: [{
  name: "extract_product_info",
  description: "Extract product details from a description",
  input_schema: {
    type: "object",
    properties: {
      name: { type: "string" },
      price: { type: "number" },
      category: { type: "string", enum: ["electronics", "clothing", "food"] }
    },
    required: ["name", "price", "category"]
  }
}],
tool_choice: { type: "tool", name: "extract_product_info" }

The tool_choice parameter forces Claude to use the specific tool rather than deciding whether to call it. Without this, Claude may choose to respond in natural language instead of calling the tool — a common failure mode when not explicitly forced.

Constrained Decoding for Self-Hosted Models

For self-hosted models (Llama, Mistral, Mixtral) deployed via vLLM, Ollama, or llama.cpp, constrained decoding is available through libraries like Guidance, Outlines, and lm-format-enforcer. These tools work by maintaining a grammar state machine alongside the model's generation — at each token step, they mask all tokens that would produce invalid output, forcing the model to only generate tokens that keep the output schema-valid.

The result is 100% schema compliance in theory — the model literally cannot generate invalid output. Modern implementations (including IterGen, ICLR 2025) have eliminated most of the latency penalty through KV-cache reuse, making constrained decoding as fast as unconstrained generation for most schemas. For teams running their own model infrastructure, this is the gold standard approach.

Error Recovery in Structured Output Pipelines

Even with mechanisms that achieve <0.3% failure rates, production systems at scale will encounter failures. Plan for them:

Parse-and-retry: Catch JSON parse errors, pass the malformed output back to the model with a repair prompt ("This JSON is malformed. Here is the error: [error]. Please output only valid JSON."). Simple retries resolve most transient failures.
Schema validation layer: Validate parsed JSON against your schema before using it. A response that is valid JSON but violates your schema (missing required field, wrong type) needs a different retry strategy — explicitly listing the schema violation rather than just the parse error.
Fallback to prose extraction: For resilient production systems, maintain a fallback that extracts data from the model's natural language output when structured output fails. This is slower and less precise but ensures the system degrades gracefully rather than hard-failing.

XML Structure as an Alternative to JSON

For Claude specifically, XML-tagged output is often more reliable than JSON for complex nested structures, because Claude is specifically trained on XML-delimited content. Defining an XML schema for your output and asking Claude to populate it within XML tags can produce lower failure rates than JSON for deeply nested or conditional schemas:

Respond using this exact XML structure:
<analysis>
  <sentiment>positive|negative|neutral</sentiment>
  <confidence>0.0–1.0</confidence>
  <key_themes>
    <theme>[theme]</theme>
  </key_themes>
</analysis>

XML output requires a different parsing step (XML parser rather than JSON parser), but for Claude-specific pipelines, it is worth benchmarking against tool use to see which produces lower failure rates for your specific schema.

Key takeaways

OpenAI strict JSON mode achieves <0.1% failure via grammar-constrained token generation; Claude tool use achieves <0.2%; Gemini responseSchema <0.3%
Claude tool use requires tool_choice: forced — without it, Claude may choose to respond in natural language rather than call the tool
Constrained decoding (Guidance, Outlines) achieves near-100% schema compliance for self-hosted models with no meaningful latency penalty in modern implementations
For Claude-specific pipelines, XML-tagged output often outperforms JSON for deeply nested schemas — benchmark both for your specific use case
Production structured output pipelines need three-tier error recovery: parse-and-retry → schema repair → prose fallback, so systems degrade gracefully

← Previous

Self-Refinement and Critique Loops

Prompt Evaluation and Testing