Extended Thinking and Reasoning Models

Advanced 🕐 24 min Lesson 5 of 16

What you'll learn

Explain how Extended Thinking and reasoning models differ architecturally from standard models
Configure Claude Extended Thinking with appropriate budget and effort settings for a given task
Identify which task types benefit from extended thinking and which are degraded by it
Apply the three rules for prompting reasoning models differently from standard models

A New Class of Model

In 2024 and 2025, a new category of AI model emerged alongside the standard "chat" models. OpenAI launched o1 and o3. Anthropic added Extended Thinking to Claude Sonnet and Opus. Google released Gemini 2.5 with Deep Think mode. These "reasoning models" share a fundamental characteristic: before producing their visible output, they perform an internal reasoning process — a hidden chain of thought — that allocates additional computation to the problem.

This changes prompt engineering in a specific and important way: the techniques from Lessons 2–4 (CoT, self-consistency, ToT) were designed to help standard models reason better. Reasoning models already reason. Applying those techniques on top of built-in reasoning often adds cost without adding quality — and sometimes makes things worse.

How Claude Extended Thinking Works

When you enable Extended Thinking in the Claude API, the model generates a "thinking block" — a hidden scratchpad where it reasons through the problem — before writing its visible response. The thinking block is available via the API but not shown to end users by default. You set a token budget for the thinking process, which controls how much reasoning time the model allocates.

The API call uses a thinking parameter:

thinking: {
type: "enabled",
budget_tokens: 10000
}

The minimum budget is 1,024 tokens. Anthropic also offers "adaptive" thinking, where you set an effort level ("low", "medium", "high", "xhigh") and the model decides its own token budget — useful when you do not know in advance how hard a problem is. For Claude Opus and Sonnet 4.x, adaptive thinking at high or xhigh effort covers the majority of use cases without manual budget tuning.

Pricing note: you are charged for the full thinking token budget consumed, not just the visible output. A call with 15,000 thinking tokens and 300 output tokens costs for approximately 15,300 input-equivalent tokens. Budget accordingly for production use.

OpenAI o1 and o3

OpenAI's reasoning models (o1, o3, o4-mini) differ from Claude's approach in one important way: the internal reasoning is completely hidden and cannot be accessed via the API. OpenAI has explicitly restricted users from attempting to reveal o1/o3's chain of thought. The model produces only its final answer.

o3 (released December 2024) was shown to make 20% fewer major errors than o1 on complex tasks, with particular improvements in coding, business reasoning, and creative problem-solving. These models benchmark extremely well on graduate-level reasoning (GPQA-Diamond) and mathematical olympiad problems — tasks where the quality of internal reasoning directly determines output correctness.

For prompting: o1 and o3 respond to clear problem statements more than to elaborate prompting scaffolds. You state what you need solved. The model manages its own reasoning. Elaborate CoT instructions, step-by-step breakdowns, and "think carefully" phrases are redundant — the model is already thinking carefully.

Gemini 2.5 Deep Think

Google's Gemini 2.5 Pro with "thinking" mode enabled uses a similar architecture to Claude Extended Thinking. The generation config accepts a thinking parameter with an optional budget_tokens field. Setting budget_tokens to -1 enables dynamic thinking, where the model adapts its reasoning depth to the problem complexity. Gemini 2.5 Pro leads several benchmarks including GPQA-Diamond (94.3% as of early 2026), particularly strong on scientific reasoning tasks.

When Extended Thinking Helps — and When It Hurts

Extended thinking is not universally beneficial. Research shows it can degrade performance by up to 36% on tasks that benefit from direct, intuitive response — the same effect seen with CoT in Lesson 2, amplified. The task types where extended thinking adds no value or hurts:

Creative writing where spontaneity matters
Conversational responses where naturalness is the goal
Pattern recognition tasks (classification, sentiment)
Simple factual lookups
Tasks requiring rapid responses in latency-sensitive applications

The task types where extended thinking consistently helps:

Complex mathematical reasoning and proof-checking
Difficult coding problems with multiple valid approaches
Multi-step logical analysis
Planning tasks with many interacting constraints
Research-level questions requiring synthesis across many considerations

How to Prompt Reasoning Models Differently

Anthropic's own documentation for Extended Thinking advises removing "step-by-step" instructions when Extended Thinking is enabled. The model manages its own reasoning budget — explicit CoT instructions tell it to surface reasoning it is already performing internally, adding visible tokens without improving accuracy.

Three rules for reasoning model prompts:

Remove CoT scaffolding: No "think step by step," no "let's work through this carefully." The model already does this.
State the problem clearly and completely: Reasoning models respond to the quality of the problem specification. A well-defined problem with clear constraints, inputs, and expected output format gives the model's internal reasoning the structure it needs.
Raise effort level rather than prompting around it: If you are not getting the quality you need, increase the thinking budget or effort level — do not add more instructions. More reasoning time beats more instruction text for these models.

To read the thinking output (Claude only) and use it to improve your prompts: if the model's thinking block takes a wrong turn early, look at what context it was missing, and add that to your system prompt or problem statement. The thinking block is a diagnostic tool, not just an output — one of the most powerful debugging capabilities available for prompt engineering.

Key takeaways

Reasoning models (Claude Extended Thinking, o1/o3, Gemini Deep Think) perform hidden internal reasoning before producing output — CoT prompts are redundant and often counterproductive
Claude Extended Thinking requires a minimum 1,024-token budget; adaptive thinking at high/xhigh effort covers most use cases without manual tuning
Extended thinking can degrade performance by up to 36% on intuitive/creative tasks — only apply it to complex reasoning, coding, and analysis
Remove step-by-step instructions for reasoning models; raise the thinking budget instead of adding more instruction text
Claude's thinking block is a diagnostic tool — read it to understand what context the model lacked, then add that to your prompt

← Previous

Tree-of-Thought: Deliberate Problem Solving

Step-Back Prompting and Abstraction