Learn Advanced Prompt Engineering Self-Consistency and Sampling Strategy

Self-Consistency and Sampling Strategy

Advanced 🕐 20 min Lesson 3 of 16
What you'll learn
  • Explain why greedy decoding is insufficient for complex reasoning tasks and how self-consistency addresses it
  • Implement self-consistency with the correct temperature range and sample count for a given task
  • Calculate the cost-accuracy trade-off of self-consistency and decide when it is worth applying
  • Distinguish between majority voting and weighted voting and choose appropriately

The Problem With Greedy Decoding

By default, language models use greedy decoding: at each step, they select the single most probable next token. This is fast and deterministic, but it has a structural weakness. If the model makes a slightly wrong turn early in a reasoning chain, every subsequent step builds on that error. There is no backtracking, no checking, no recovery. The model commits to its first path and follows it to the end.

Self-consistency, introduced by Wang et al. in 2022, addresses this by running the same prompt multiple times with temperature greater than zero — so the model takes different paths each time — and then selecting the answer that appears most frequently across all runs. The intuition: a correct answer to a well-posed problem will tend to emerge from many different valid reasoning paths, while an incorrect answer is more likely to appear only when the model makes a specific wrong turn.

How It Works in Practice

The implementation is straightforward:

  1. Write your CoT prompt as normal.
  2. Run it N times — typically 5 to 10 — with temperature set between 0.5 and 0.8. This introduces enough randomness to produce diverse reasoning paths without degrading coherence.
  3. Extract the final answer from each run (not the reasoning — just the conclusion).
  4. Select the answer that appears most often. In ties, either answer is valid or you can defer to the run with the longest supporting reasoning chain.

Wang et al. demonstrated that accuracy on GSM8K math problems improved from 51.7% with standard greedy CoT to 68% with self-consistency using just 5 samples at temperature 1.0. That is a 16-point absolute improvement from a technique that requires no prompt changes — only repeated calls and a voting step.

Temperature Calibration for Reasoning Tasks

Temperature is the parameter that controls randomness in sampling. At temperature 0, the model always selects the most likely token (greedy). At higher temperatures, lower-probability tokens become more likely. For self-consistency, you need diversity across runs — but not so much randomness that the model loses coherence entirely.

Research shows that temperatures between 0.5 and 0.8 produce the best accuracy gains for reasoning tasks using self-consistency. Below 0.5, runs are too similar to each other to provide meaningful diversity. Above 1.0, some reasoning paths become incoherent and drag down voting accuracy. A temperature of 0.7 is a reliable default for most self-consistency applications.

One model-specific note: Claude models tend to produce reliable diversity at slightly lower temperatures (0.5–0.7) than GPT-4o models, which often benefit from 0.7–0.9 for the same level of path diversity. Gemini 2.5 Pro shows strong consistency even at temperature 1.0 due to its training approach. Test on your specific task to find the optimal setting.

Cost Trade-offs and When to Use Self-Consistency

Self-consistency multiplies your API cost by N — running 5 samples costs 5 times more than a single call. This makes it unsuitable for high-volume, low-stakes applications. The calculus changes for high-stakes, low-volume tasks where getting the right answer is worth the cost:

  • Worth it: Medical diagnosis assistance, legal reasoning, financial calculations, code correctness on critical paths, scientific analysis
  • Not worth it: Chatbot responses, content generation at scale, classification tasks, tasks where the model is already reliable (>90% accuracy)

A practical optimisation: use 3 samples instead of 5 or 10. The research shows that the accuracy gains from self-consistency are front-loaded — going from 1 to 3 samples captures most of the available improvement. Going from 3 to 10 adds cost but diminishing returns. For most applications, 3 is the pragmatic sweet spot.

Majority Voting vs. Weighted Voting

Standard self-consistency uses majority voting: the most frequent answer wins. A more sophisticated variant is weighted voting, where you weight each answer by the model's stated confidence or by the length and coherence of the supporting reasoning chain. This requires more post-processing but can improve accuracy further, particularly when the model produces one very detailed correct reasoning path alongside several shorter incorrect ones.

A simple weighted voting heuristic: if one answer appears with a complete multi-step reasoning chain and another appears with a brief or broken chain, prefer the well-reasoned answer even if the vote count is tied. For automated pipelines, you can implement this by scoring reasoning chain completeness (does it include numbered steps, does it reach an explicit conclusion) and using the score as a tiebreaker.

Self-Consistency and Reasoning Models

Like CoT, self-consistency is most valuable for standard (non-reasoning) models. Models like o1, o3, and Claude with Extended Thinking enabled already perform internal multi-path exploration before producing output. Running self-consistency on top of a reasoning model adds cost without proportional benefit — the model has effectively already voted internally. Reserve self-consistency for Claude Sonnet (non-extended-thinking), GPT-4o, and Gemini 2.5 Flash on tasks where you need maximum accuracy and can absorb the cost.

Key takeaways
  • Self-consistency runs the same CoT prompt N times with temperature > 0 and selects the most frequent answer — no prompt changes required
  • Accuracy on GSM8K improved from 51.7% to 68% with just 5 samples at temperature 1.0 (Wang et al. 2022)
  • Temperature 0.5–0.8 is optimal for reasoning diversity — too low gives redundant runs, too high gives incoherent ones
  • 3 samples capture most of the available accuracy improvement; going to 10 adds cost with diminishing returns
  • Self-consistency adds no meaningful value over reasoning models (o1, o3, Claude Extended Thinking) which already explore multiple paths internally