Prompt Chaining and Pipeline Design

Advanced 🕐 22 min Lesson 10 of 16

What you'll learn

Identify which of the four chaining patterns is appropriate for a given complex task
Design validation gates between chain steps to catch format and semantic failures before they propagate
Implement failure handling with retry logic, quality thresholds, and checkpointing for production chains
Apply the principle of model selection per step to optimise cost without sacrificing quality at critical steps

Why Single Prompts Have Limits

A single prompt asks one model to simultaneously understand a complex problem, reason through it, format the output correctly, apply all relevant constraints, and produce a coherent response — all in one pass. For simple tasks, this is fine. For complex, multi-stage tasks, it is asking too much. The model is trying to hold too many competing considerations in attention at once, and some get dropped.

Prompt chaining solves this by decomposing a complex task into a series of simpler, focused prompts. Each prompt has one clear job. The output of step N becomes part of the input for step N+1. This is how complex AI workflows are built in production — not with a single heroic prompt, but with a pipeline of coordinated, single-purpose steps.

A 2024 survey found that 43% of enterprise AI deployments use graph-based chaining workflows. LangChain reported that the average number of steps per AI workflow trace doubled from 2.8 to 7.7 between 2023 and 2024. Research comparing chaining to monolithic prompts on equivalent tasks showed 15.6% better accuracy for chained approaches on complex multi-step work.

The Four Chaining Patterns

Pattern 1: Sequential chaining
The simplest and most common pattern. Output from Step 1 flows directly to Step 2, which flows to Step 3, and so on. Use this for multi-stage tasks where each stage depends on the previous one: research → outline → draft → edit → format.

Pattern 2: Branching chaining
A single output fans out to multiple parallel sub-chains. Use this when different aspects of a task can be handled independently and then merged. Example: a document analysis that simultaneously extracts key claims, identifies gaps, and checks citations, with results merged into a final report.

Pattern 3: Iterative chaining
A prompt runs repeatedly until a condition is met. Use this for refinement loops — generate, evaluate, revise — where you do not know in advance how many iterations are needed. Set a maximum iteration count to prevent infinite loops.

Pattern 4: Hierarchical chaining
A high-level orchestrator prompt decomposes a task and delegates sub-tasks to specialised sub-chains. The orchestrator collects and integrates results. This is the foundation of multi-agent architectures and is explored further in Lesson 16.

Validation Gates

The power of chaining comes with a risk: if Step 1 produces a flawed output, every subsequent step builds on that flaw. Without validation between steps, errors compound. Validation gates are checkpoints between steps that verify the output before it is passed forward.

Three types of validation gates:

Format validation: Check that the output has the required structure before passing it forward. A step that should return JSON should be validated as parseable JSON before the next step tries to use it. Tools like Pydantic make this easy in Python pipelines.
Semantic validation: Ask a separate prompt to evaluate whether the output meets the quality requirements. "Does this outline contain at least 5 distinct main points, each with a clear argumentative claim?" This adds a call but catches content failures early.
Confidence thresholds: For classification or extraction steps, include a confidence score in the output schema and only proceed if confidence exceeds a threshold. Below threshold, trigger a retry or an escalation path.

Failure Handling and Rollback

Production chains need failure handling. The three failure modes to plan for:

Transient failures: A step fails due to a temporary API error or timeout. Solution: retry with exponential backoff. Implement automatic retries for steps that produce errors, with a maximum retry count.
Quality failures: A step produces output that passes format validation but fails semantic validation. Solution: retry the specific step with a modified prompt (adding clarification or examples), then continue the chain if the retry succeeds.
Cascade failures: A flawed output that passes validation propagates through the chain and corrupts later steps. Solution: checkpointing — save the output of each step so that if a later step fails, you can restart from the last valid checkpoint rather than restarting the entire chain.

For long chains (more than 4–5 steps), checkpointing is not optional — it is essential. A 10-step chain with no checkpoints means a failure at step 9 requires re-running steps 1–8, wasting all their compute and cost.

Practical Design Principles

Several design principles emerge from working with production prompt chains:

One job per step: Each step in the chain should have exactly one responsibility. Steps that try to do two things simultaneously are the most common source of chain failures.
Explicit output contracts: Define the exact format each step must produce. Both the step that generates and the step that consumes should reference the same schema. Ambiguous interfaces between steps create hard-to-debug failures.
Model selection per step: Not every step needs the strongest model. Classification steps, format-checking steps, and extraction steps often work well with faster, cheaper models (Claude Haiku, GPT-4o-mini). Reserve expensive models for the steps that require them.
Avoid chaining more than 5 steps without a summarisation step: In very long chains, early-step context can drift out of effective recall by later steps. A summarisation step every 4–5 steps — compressing earlier outputs into a concise context block — maintains coherence through long pipelines.

When Not to Chain

Chaining adds latency, cost, and complexity. Before building a chain, check whether a well-designed single prompt with careful few-shot examples achieves the same result. If the task has clearly separable stages that each require focused attention, chaining is justified. If the task is actually one coherent judgment call — even a complex one — a single prompt with a strong system prompt and good examples may outperform a chain, without the operational overhead.

Key takeaways

Chaining achieves 15.6% better accuracy than monolithic prompts on complex tasks — 43% of enterprise AI deployments now use multi-step graph workflows
Four patterns: sequential (most common), branching (parallel), iterative (refinement loops), hierarchical (orchestrator/delegate)
Validation gates between steps — format, semantic, and confidence-threshold checks — prevent errors from cascading through the pipeline
Checkpointing at each step is essential for chains longer than 4–5 steps — failures near the end of a long chain should not require restarting from step 1
Each step should have one job; use cheaper models for classification and format-checking steps; reserve expensive models for steps that genuinely need them

← Previous

Persona Engineering at Depth

Meta-Prompting and Automatic Prompt Optimization