Prompt Library ⚙️ Technical LLM Benchmark Design for Your Use Case
GPT-4o ⚙️ Technical Advanced

LLM Benchmark Design for Your Use Case

Design a custom benchmark to objectively compare AI models on your specific tasks before committing to one.
👁 4 views ⎘ 0 copies ♥ 0 likes

The Prompt

# LLM Benchmark Design for Your Use Case

You are an AI evaluation engineer. Design a custom benchmark that objectively tests and compares multiple large language models on your specific use case — producing data-driven model selection instead of relying on generic leaderboards that may not reflect your real-world needs.

## Use Case

**Task I need an LLM for:** [TASK_DESCRIPTION]
**Volume:** approximately [VOLUME] tasks per [PERIOD]
**Quality requirements:** [QUALITY_STANDARD]
**Budget constraint:** maximum [COST_LIMIT] per [UNIT]
**Models to compare:** [MODEL_LIST]

## Benchmark Design

### Test Set Construction
Build a test set of [TEST_SIZE] representative tasks. The test set must include:
- [PCT_EASY]% straightforward cases (typical inputs)
- [PCT_MEDIUM]% moderately complex cases
- [PCT_HARD]% edge cases and adversarial inputs
- [PCT_REAL]% real examples from your actual use (anonymized)

### Evaluation Metrics

Define [METRIC_COUNT] measurable metrics for this task:

| Metric | How Measured | Weight | Acceptable Threshold |
|--------|-------------|--------|---------------------|
| [METRIC_1] | [MEASUREMENT_METHOD] | [WEIGHT]% | [THRESHOLD] |
| [METRIC_2] | [MEASUREMENT_METHOD] | [WEIGHT]% | [THRESHOLD] |
| [METRIC_3] | [MEASUREMENT_METHOD] | [WEIGHT]% | [THRESHOLD] |

### Scoring Protocol
- Human evaluation for: [SUBJECTIVE_METRICS]
- Automated evaluation for: [AUTOMATED_METRICS]
- Blind scoring: evaluators should not know which model produced which output

## Results Template

Present results as:
| Model | [METRIC_1] | [METRIC_2] | [METRIC_3] | Cost/Task | Overall Score |
|-------|-----------|-----------|-----------|----------|--------------|

## Decision Rule

Select the model that: [DECISION_RULE — e.g., maximizes overall score while staying under cost limit, scores highest on METRIC_1 above all else].

📝 Fill in the blanks

Replace these placeholders with your own content:

[TASK_DESCRIPTION]
[VOLUME]
[PERIOD]
[QUALITY_STANDARD]
[COST_LIMIT]
[UNIT]
[MODEL_LIST]
[TEST_SIZE]
[PCT_EASY]
[PCT_MEDIUM]
[PCT_HARD]
[PCT_REAL]
[METRIC_COUNT]
[METRIC_1]
[MEASUREMENT_METHOD]
[WEIGHT]
[THRESHOLD]
[METRIC_2]
[METRIC_3]
[SUBJECTIVE_METRICS]
[AUTOMATED_METRICS]
[DECISION_RULE — e.g., maximizes overall score while staying under cost limit, scores highest on METRIC_1 above all else]

How to use this prompt

1
Copy the prompt

Click "Copy Prompt" above to copy the full prompt text to your clipboard.

2
Replace the placeholders

Swap out anything in [BRACKETS] with your specific details.

3
Paste into GPT-4o

Open your preferred AI assistant and paste the prompt to get started.