Prompt Evaluation and Testing

Advanced 🕐 24 min Lesson 14 of 16

What you'll learn

Design a regression test suite with the correct proportion of happy path, edge case, hallucination, and format compliance tests
Implement LLM-as-judge evaluation using G-Eval methodology with controls for position, verbosity, and self-enhancement bias
Integrate prompt evaluation into a CI/CD pipeline so quality gates block regressions from reaching production
Choose between shadow testing and canary deployment based on the risk profile of a given prompt change

Why Evaluation Comes Before Iteration

Without measurement, prompt iteration is guesswork. You change a prompt, test it on a few examples, think it looks better, and ship it — only to find that it regressed on a different class of inputs you did not think to check. This is how production prompts get worse over time: incremental "improvements" that help on the developer's test cases while silently degrading on real user inputs.

Evaluation transforms prompt engineering from an art into a discipline. When you can measure prompt quality against a defined set of test cases, you know whether a change is an improvement or a regression before it reaches production. Teams using systematic eval frameworks report 40–60% faster iteration cycles and significantly fewer production incidents compared to ad-hoc testing.

LLM-as-Judge

Human evaluation is the gold standard for LLM output quality, but it is expensive and slow. LLM-as-judge uses a capable model to evaluate outputs from another model — scoring them against defined criteria in the way a human expert would. Research shows LLM-as-judge achieves approximately 80% agreement with human preferences, matching human-to-human consistency rates, at roughly 500 to 5000 times lower cost.

The methodology uses three evaluation patterns:

Pointwise evaluation: Score each output individually against defined criteria. "On a scale of 1–5, how accurately does this response answer the question? Explain your rating." Most common for quality monitoring.
Pairwise evaluation: Compare two outputs directly. "Which of these two responses better addresses the user's question, and why?" More reliable than pointwise for detecting small quality differences between prompt versions.
Pass/fail evaluation: Binary classification against specific criteria. "Does this response avoid making specific medical recommendations? Yes/No." Most useful for safety and compliance checks.

G-Eval formalises this with chain-of-thought: the judge model is asked to produce a reasoning chain before giving a score, which improves calibration and produces explainable evaluations. G-Eval is available as a metric in DeepEval (open-source) and supported natively in Braintrust.

Known Biases in LLM-as-Judge

LLM-as-judge has documented biases that corrupt evaluations if not controlled for:

Position bias: GPT-4 shows approximately 40% inconsistency based on which response appears first in a pairwise comparison. The first response receives a higher rating regardless of quality. Mitigation: always run pairwise comparisons in both orders (A vs B and B vs A) and average the results.
Verbosity bias: Longer responses are rated approximately 15% higher than shorter responses of equal quality. Mitigation: instruct the judge model to evaluate quality independent of length, and include examples of high-quality short responses in the judge prompt.
Self-enhancement bias: A model evaluating outputs from the same model family inflates scores by 5–7%. Mitigation: use a different model family as judge (Claude evaluating GPT-4o outputs, or vice versa).

For production eval systems, using a stronger model as judge than the model being evaluated reduces bias. GPT-5 or Claude Opus as judge, evaluating outputs from smaller models, produces the most reliable results.

Building a Regression Test Suite

The 2025 standard practice for production prompt management is a regression test suite that runs on every prompt change. A practical suite structure:

Happy path cases (40%): Typical inputs that represent the most common user queries. The baseline that should always pass.
Edge cases (30%): Inputs at the boundary of the prompt's scope. Unusual phrasings, incomplete inputs, ambiguous requests.
Hallucination checks (20%): Questions designed to test whether the model stays grounded in provided context or fabricates answers. Include "trap" questions where the answer is not in the provided context, and verify the model responds with uncertainty rather than invention.
Format compliance (10%): Checks that the output matches the required structure. Parse the output and verify required fields, correct types, length constraints.

Suite size: 15–20 test cases is the standard starting point for most applications. Enough to catch common regressions without being expensive to run on every change. Expand the suite incrementally as you encounter new failure modes in production.

CI/CD Integration

Tools like Braintrust support direct GitHub Action integration: when a pull request changes a prompt file, an eval run triggers automatically, scores every test case against the suite, and posts the per-test-case regression diff to the PR. Merges can be blocked if overall score drops below a threshold. This makes prompt quality a gating condition for deployment, identical to how test suites gate code changes.

The workflow: prompts are stored as versioned files in source control. Changes go through pull requests. Evals run in CI. Score drops block the merge. This treats prompts as code — with the discipline that implies.

A/B Testing Prompt Changes in Production

For production prompt changes where the eval suite shows improvement but you want real-user validation before full rollout, two patterns apply:

Shadow testing: Users see the production prompt's response (Control). The new prompt (Treatment) runs asynchronously in parallel, and its response is logged and evaluated offline without being shown to the user. Zero user impact, full evaluation capability. Use this when you need real production traffic patterns for evaluation but cannot afford any user experience risk.

Canary deployment: Route a small percentage of live traffic (1–5%) to the new prompt and monitor quality metrics before expanding to full rollout. Unlike shadow testing, real users see the new prompt's responses, which provides genuine engagement signal. Use this when you have high confidence in the new prompt from eval results and want to validate at scale before full rollout.

Key metric to monitor during A/B tests: not just output quality scores, but user behaviour signals — did users follow up with clarifying questions (suggesting the first response was unclear)? Did they disengage? These behavioural signals often surface quality issues that automated eval misses.

Key takeaways

LLM-as-judge achieves 80% agreement with human evaluation at 500–5000x lower cost — the foundation of scalable prompt testing
Three biases corrupt LLM-as-judge evaluations: position bias (40% inconsistency), verbosity bias (15% inflation), self-enhancement bias (5–7% boost) — control for each explicitly
Regression test suites (15–20 cases: 40% happy path, 30% edge cases, 20% hallucination checks, 10% format compliance) should run on every prompt change in CI/CD
Pairwise evaluation is more reliable than pointwise for detecting small quality differences between prompt versions — always evaluate in both orders
Shadow testing (offline, zero user impact) vs. canary deployment (live, 1–5% traffic) — choose based on confidence level and acceptable user experience risk

← Previous

Structured Output Engineering

Production Prompt Engineering