How Language Models Actually Think
- Explain how token prediction mechanics underlie every prompting technique
- Describe why allocating more output tokens improves complex reasoning performance
- Identify the lost-in-the-middle effect and its practical implications for context structure
- Articulate why prompting is sensitive to surface features that humans would treat as equivalent
Why This Matters Before Anything Else
Most prompt engineering guides hand you templates without explaining the mechanism behind them. That works until it stops working — and then you have no framework for diagnosing the problem. This lesson builds that framework. Understanding why chain-of-thought works, why position in context matters, and why longer outputs cost more is not academic trivia. It is the foundation every subsequent technique in this course is built on.
The Core Mechanism: Token Prediction
A language model does one thing: it predicts the most likely next token given everything that came before it. A token is roughly a word fragment — the word "prompting" might be two tokens ("prompt" and "ing"), while "a" is one token. At each step, the model assigns a probability to every token in its vocabulary (tens of thousands of options) and selects the next one. It then repeats this for the token after that, and so on until it stops.
This means everything about a model response — its reasoning, its format, its accuracy — emerges from this single repeated operation. There is no separate "reasoning module" or "fact lookup system." The model cannot consult a knowledge base mid-response. Every word it writes influences every word that follows, because each generated token becomes part of the context for the next prediction.
This has a critical implication for prompt engineering: the words you put in a prompt shift the probability distribution over what comes next. When you say "You are a senior engineer reviewing this code," the word "senior" and "engineer" and "reviewing" activate statistical patterns from the models training data — millions of examples of senior engineers reviewing code. The models next tokens are drawn from that distribution, not from a blank slate.
Computation Budget and Why CoT Works
Here is an insight that explains several techniques at once: each token the model generates represents computation. More output tokens mean more processing steps, more opportunities to correct errors in earlier tokens, and more working memory for complex problems.
When you ask a model to "think step by step," you are not just asking it to show its work for your benefit. You are allocating more computation to the problem. A model answering "What is 17 × 23?" in a single token has one shot to get it right. A model that writes out "17 × 23 = 17 × 20 + 17 × 3 = 340 + 51 = 391" has multiple opportunities to catch arithmetic errors across multiple generation steps. The intermediate tokens literally serve as working memory.
This is why chain-of-thought only helps on tasks that require multi-step reasoning. For a simple factual question, adding reasoning steps adds tokens without adding accuracy. The technique is not magic — it is computation allocation.
Emergence and Scale
Not all models benefit equally from reasoning prompts. Research shows that chain-of-thought provides meaningful gains only in models above approximately 100 billion parameters. Below that threshold, asking a smaller model to "think step by step" often produces plausible-sounding but incorrect reasoning — the model does not have sufficient capacity to reason reliably across steps. Larger models have developed what researchers call "emergent" capabilities: behaviors that appear suddenly at scale rather than improving gradually.
In practical terms: the prompting techniques in this course work best with frontier models — Claude Sonnet and Opus, GPT-4o, Gemini 2.5 Pro. Applying advanced techniques to smaller or older models may produce worse results than a simpler direct prompt. Know your model before investing in sophisticated prompting.
Context Position and the Lost-in-the-Middle Effect
Research published in 2023 (Liu et al.) identified a consistent weakness across all major language models: information in the middle of a long context is recalled less reliably than information at the beginning or end. Performance on multi-document QA tasks drops by more than 30% when the relevant document moves from position 1 to position 10 in a 20-document context.
The practical implication: when you are building prompts with multiple documents, examples, or reference materials, position your most critical content either early (the system prompt and opening context) or late (immediately before the final instruction or question). The middle of a long prompt is the least reliable real estate.
This effect persists even in models with million-token context windows. Claude 1M, Gemini 1M — the advertised context size is the ceiling, not a guarantee of uniform attention. A useful heuristic: the reliable effective context for most models is 60–70% of their advertised maximum, and strategic positioning within that window matters more than raw capacity.
What "Understanding" Actually Means Here
Language models do not understand text the way humans do. They do not form internal representations of meaning and then translate those into words. They predict statistically plausible continuations of token sequences, and the patterns they have learned from training data are rich enough that this prediction often produces outputs that look like understanding, reasoning, and knowledge.
This distinction matters for prompt engineering in a specific way: models are sensitive to surface features that humans would ignore. The exact wording of an instruction, whether you use a numbered list or bullet points, whether you say "do not" or "avoid" — these are not stylistic preferences. They shift the token distributions that follow. Expert prompt engineers think at this level. They test variations, measure differences, and treat prompts as code that needs to be debugged and optimized rather than natural language that "means the same thing either way."
- Every token a model generates is a prediction — more output tokens mean more computation and better performance on complex tasks
- Chain-of-thought works by allocating computation to a problem through intermediate tokens, not by accessing a separate reasoning system
- Information in the middle of long contexts is recalled 30%+ less reliably than information at the start or end — position critical content strategically
- Advanced prompting techniques work best on frontier models (100B+ parameters); applying them to smaller models can degrade performance
- Models are sensitive to surface features humans would treat as equivalent — exact wording, order, and formatting all shift token probability distributions