Chain-of-Thought Mastery

Advanced 🕐 22 min Lesson 2 of 16

What you'll learn

Describe the Wei et al. 2022 findings and what they established about chain-of-thought
Identify task types where CoT improves performance vs. task types where it degrades performance
Apply the correct CoT variant (zero-shot, few-shot, or structured) to a given problem type
Explain why CoT should be removed for reasoning models like o1, o3, and Claude Extended Thinking

From Track 2 to Mastery

If you completed Prompting 101, you have seen chain-of-thought introduced as "ask the model to think step by step." That description is accurate but incomplete. At the advanced level, chain-of-thought is a tool with a specific mechanism, a defined set of conditions where it helps, and a defined set of conditions where it actively hurts. Mastering it means understanding all three.

The Original Research

Chain-of-thought prompting was formalised by Wei et al. in a 2022 paper that demonstrated striking results on mathematical and logical reasoning benchmarks. On the MultiArith dataset, accuracy improved from 17.7% to 78.7% — more than a fourfold improvement from a single prompting change. On the GSM8K math word problem benchmark, accuracy rose from 10.4% to 40.7%. The technique was not new — practitioners had been doing something similar informally — but the paper quantified the effect at scale and established it as a rigorous discipline.

Two variants emerged from this work:

Few-shot CoT: You include examples in your prompt that demonstrate step-by-step reasoning before the actual question. The model learns the reasoning pattern from the demonstrations and applies it to the new problem.
Zero-shot CoT: You simply append "Let's think step by step" to your prompt. This was shown to elicit reasoning behaviour without any examples — a finding that surprised the research community.

The Mechanism: Why It Works

As explained in Lesson 1, each generated token is a computation step. CoT works by externalising intermediate reasoning into the token stream, where each step can be verified (by the model continuing from it) and where errors in early steps produce visible signals that the model can potentially correct. A model that writes "17 × 20 = 340, 17 × 3 = 51, 340 + 51 = 391" has three steps to get right. If the first one is wrong, the numbers downstream will not add up, and a capable model will often catch and correct the discrepancy.

This is fundamentally different from what happens in a single-token answer: the model has one shot, with no mechanism for detecting and correcting errors.

When Chain-of-Thought Hurts

The most important insight for advanced practitioners is that CoT is not universally beneficial. A 2024 study (published in arXiv) tested CoT on tasks where human cognition is known to degrade with explicit reasoning — implicit statistical learning, pattern recognition, and certain types of intuitive judgment. Across GPT-4o, Claude 3 Opus, Gemini 2.x, and Llama 3.1, CoT degraded performance by 6–23% on these task types.

The analogy to human cognition is instructive: asking someone to verbally explain why they recognize a face or how they balance a bicycle makes them worse at both tasks. Some competencies are harmed by explicit articulation. The same is true for language models on tasks that depend on pattern recognition across the full input rather than sequential logical deduction.

Before applying CoT to a new task type, ask: does this problem benefit from step-by-step deduction, or does it require holistic pattern matching? Mathematical reasoning, logical puzzles, and multi-step calculation benefit from CoT. Sentiment classification, grammatical judgment, and many NLP tasks may not.

Model-Specific CoT Effectiveness

Research from Wharton Generative AI Labs (2025) measured CoT improvement across models on identical tasks. The results vary significantly:

Gemini Flash 2.0: +13.5% improvement with CoT prompting
Claude Sonnet 3.5/3.7: +11.7% improvement
GPT-4o-mini: +4.4% improvement (not statistically significant in many task types)

The critical finding for reasoning models like o1, o3, and Claude with Extended Thinking enabled: CoT adds only 2–3% improvement at the cost of 20–80% more tokens and time. These models have built-in reasoning mechanisms that CoT merely duplicates — you are essentially asking the model to do work it has already done internally. For reasoning models, the correct strategy is to remove explicit CoT instructions and let the model manage its own reasoning budget. This is explored in depth in Lesson 5.

Writing Effective CoT Prompts

For non-reasoning models on appropriate task types, here are the structural patterns that produce the best results:

Zero-shot CoT (simplest): Append one of these phrases to your prompt:

"Let's think through this step by step."
"Work through this carefully before giving your answer."
"Think out loud as you solve this, then give your final answer."

Few-shot CoT (highest reliability): Include 2–3 examples that demonstrate the reasoning pattern you want. Position them immediately before the actual problem. Each example should show the full reasoning chain, not just the answer. When writing examples for your domain, prefer examples that cover different reasoning sub-patterns rather than examples that are all similar — diversity in examples generalises better than similarity.

Structured CoT (for complex multi-step problems):

Problem: [state the problem]

Let me work through this systematically:
Step 1: [first sub-problem or constraint]
Step 2: [next step, building on step 1]
...
Final answer: [conclusion drawn from steps]

One practical note on model differences: Claude models respond well to instructions framed as part of the task ("Work through this step by step before answering"). GPT-4o often benefits from role framing alongside CoT ("As a careful mathematician, solve this step by step"). Gemini 2.5 Pro shows strong CoT performance with minimal prompting — the zero-shot variant often matches few-shot quality.

Key takeaways

CoT improved MultiArith accuracy from 17.7% to 78.7% — but only on tasks requiring sequential deduction, not pattern recognition
CoT degrades performance 6–23% on tasks where explicit reasoning hurts humans too: sentiment, grammar, and intuitive pattern recognition
Model-specific CoT gains vary widely: Gemini Flash +13.5%, Claude Sonnet +11.7%, GPT-4o-mini +4.4% (often not significant)
For reasoning models (o1, o3, Claude Extended Thinking): skip CoT entirely — it adds 20–80% more tokens for only 2–3% gain
Few-shot CoT examples should be diverse across reasoning sub-patterns, not similar — diversity generalises better to new problems

← Previous

How Language Models Actually Think

Self-Consistency and Sampling Strategy