Advanced Few-Shot Engineering

Advanced 🕐 22 min Lesson 7 of 16

What you'll learn

Determine the appropriate shot count for a given task type and context budget
Apply diversity-based selection to choose examples that generalise better than similarity-based selection
Structure example ordering to maximise the recency effect and minimise ordering sensitivity
Construct contrastive example pairs that teach the model the boundary between correct and incorrect output

Beyond "Add a Few Examples"

If you completed Prompting 101, you know that including examples in a prompt — showing the model what you want before asking for it — reliably improves output quality. At the advanced level, few-shot engineering is a systematic discipline: how many examples maximises performance, which examples to choose, in what order to arrange them, and how the context window size of modern models changes the calculus entirely.

Optimal Shot Count

The research on optimal shot count shows consistent patterns. Performance improves rapidly up to approximately 10 examples, with each additional shot providing meaningful gain up to that point. Beyond 10, returns diminish but do not stop — 2024 research demonstrated that moving from few-shot (3–5 examples) to many-shot (50–100+ examples) produces meaningful additional gains across diverse tasks, particularly for tasks requiring fine-grained output style or domain-specific formatting.

The catch: many-shot was only practically possible once context windows expanded to 100k+ tokens. With Claude's 200k context, GPT-4o's 128k, and Gemini's 1M context, including 50–200 examples is now feasible for applications where the performance gain justifies the token cost. For most tasks, 3–5 diverse examples is the pragmatic default; for high-value, high-volume applications where even 2–3% accuracy gains matter, testing the many-shot regime is worthwhile.

Shot Selection: Diversity Over Similarity

Intuition suggests choosing examples similar to the target — if you are asking about restaurants, include restaurant examples. Research shows this is wrong. Similarity-based selection introduces topical bias, making the selected examples too similar to each other and to the query. When all your examples are about restaurants, the model overfits to restaurant-specific patterns rather than learning the general pattern you want.

The superior strategy is diversity-based selection: choose examples that represent different sub-types of the task. If your task is "classify customer feedback as positive, negative, or neutral," select examples that include short feedback and long feedback, emotional and matter-of-fact language, product complaints and service complaints, and edge cases that are genuinely ambiguous. The model learns the general pattern, not a narrow slice of it.

For systematic diversity at scale, clustering-based selection works well: cluster your example library by semantic similarity, then pick one representative example from each cluster. This guarantees coverage without manually curating every example.

The Ordering Effect

Example ordering has a larger impact on accuracy than most practitioners realise. Research (ACL 2024) shows that given the same three examples arranged in different orders, accuracy can shift by more than 40% — purely from ordering, with no other changes. The model is highly sensitive to recency: examples placed later in the prompt have stronger influence on the response than examples placed earlier.

Two practical implications:

Put your most representative example last, immediately before the actual input. The model will pattern-match most strongly against the most recent example it processed.
Do not put confusing or edge-case examples last. Difficult examples are useful for calibration, but placing them at the end of your example sequence causes the model to pattern-match against your hardest case rather than your clearest one.

Batch Calibration (BC) is a 2024 technique that reduces ordering sensitivity by running multiple orderings and aggregating results. If you are finding that your few-shot prompt is inconsistent across different orderings, applying BC — similar in concept to self-consistency but varying example order rather than temperature — can stabilise performance.

Contrastive Example Pairs

A powerful but underused few-shot technique: instead of including only positive examples (showing what you want), include contrastive pairs — one example of the right output alongside one example of the wrong output, with explanation of the difference. This teaches the model the boundary between correct and incorrect rather than just the shape of the correct output.

Example structure for a writing task:

Input: Write a one-sentence product description for a noise-cancelling headphone.

Good output: "Block out the world and lose yourself in pure sound with up to 30 hours of battery life and adaptive noise cancellation that responds to your environment."

Why this works: Specific feature (30 hours), clear benefit (block out world), sensory language (pure sound).

Bad output: "These headphones are very good and have great sound quality and noise cancellation."

Why this fails: Vague adjectives, no specific features, no benefit language, no sensory appeal.

Contrastive examples are particularly effective for tasks where quality is subjective or where the wrong output is easy to produce — writing, tone matching, formatting, and complex classification tasks with ambiguous category boundaries.

Model-Specific Behaviour Differences

Not all models respond equally to few-shot prompting. Research from 2025 comparing major models on identical tasks with varying shot counts found:

Gemini Flash 2.0: Shows the strongest improvement from few-shot examples (+13.5% with CoT, comparable gains with pure few-shot). Particularly benefits from 3–5 examples even on straightforward tasks.
Claude Sonnet/Opus: Strong few-shot performance; responds well to well-structured examples with explicit reasoning chains.
GPT-4o: Good few-shot performance; less sensitive to ordering effects than smaller models.
Mixtral-8×7B: A notable exception — actually shows decreased performance with few-shot examples compared to zero-shot. If you are working with Mixtral or similar mixture-of-experts models, test zero-shot first before adding examples.

The general principle: verify your few-shot strategy on your specific model before assuming it helps. Most frontier models benefit, but the gain varies by model, task, and example quality.

Key takeaways

Performance improves meaningfully up to ~10 examples, then continues more slowly through the many-shot regime (50–100+) enabled by large context windows
Diversity-based selection outperforms similarity-based selection — choose examples that represent different sub-types, not just examples similar to the target
Example ordering affects accuracy by more than 40% — put your most representative example last, immediately before the actual input
Contrastive pairs (good + bad examples with explanations) teach model boundaries more effectively than positive examples alone
Mixtral and some MoE models show decreased performance with few-shot examples — test zero-shot first on non-standard model architectures

← Previous

Step-Back Prompting and Abstraction

System Prompt Architecture for Production