GPT-4o ⚙️ Technical Advanced

LLM Benchmark Design for Your Use Case

Design a custom benchmark to objectively compare AI models on your specific tasks before committing to one.

👁 13 views ⎘ 0 copies ♥ 0 likes

The Prompt

# LLM Benchmark Design for Your Use Case

You are an AI evaluation engineer. Design a custom benchmark that objectively tests and compares multiple large language models on your specific use case — producing data-driven model selection instead of relying on generic leaderboards that may not reflect your real-world needs.

## Use Case

**Task I need an LLM for:** [TASK_DESCRIPTION]
**Volume:** approximately [VOLUME] tasks per [PERIOD]
**Quality requirements:** [QUALITY_STANDARD]
**Budget constraint:** maximum [COST_LIMIT] per [UNIT]
**Models to compare:** [MODEL_LIST]

## Benchmark Design

### Test Set Construction
Build a test set of [TEST_SIZE] representative tasks. The test set must include:
- [PCT_EASY]% straightforward cases (typical inputs)
- [PCT_MEDIUM]% moderately complex cases
- [PCT_HARD]% edge cases and adversarial inputs
- [PCT_REAL]% real examples from your actual use (anonymized)

### Evaluation Metrics

Define [METRIC_COUNT] measurable metrics for this task:

| Metric | How Measured | Weight | Acceptable Threshold |
|--------|-------------|--------|---------------------|
| [METRIC_1] | [MEASUREMENT_METHOD] | [WEIGHT]% | [THRESHOLD] |
| [METRIC_2] | [MEASUREMENT_METHOD] | [WEIGHT]% | [THRESHOLD] |
| [METRIC_3] | [MEASUREMENT_METHOD] | [WEIGHT]% | [THRESHOLD] |

### Scoring Protocol
- Human evaluation for: [SUBJECTIVE_METRICS]
- Automated evaluation for: [AUTOMATED_METRICS]
- Blind scoring: evaluators should not know which model produced which output

## Results Template

Present results as:
| Model | [METRIC_1] | [METRIC_2] | [METRIC_3] | Cost/Task | Overall Score |
|-------|-----------|-----------|-----------|----------|--------------|

## Decision Rule

Select the model that: [DECISION_RULE — e.g., maximizes overall score while staying under cost limit, scores highest on METRIC_1 above all else].

📝 Fill in the blanks

Replace these placeholders with your own content:

[TASK_DESCRIPTION]

[VOLUME]

[PERIOD]

[QUALITY_STANDARD]

[COST_LIMIT]

[UNIT]

[MODEL_LIST]

[TEST_SIZE]

[PCT_EASY]

[PCT_MEDIUM]

[PCT_HARD]

[PCT_REAL]

[METRIC_COUNT]

[METRIC_1]

[MEASUREMENT_METHOD]

[WEIGHT]

[THRESHOLD]

[METRIC_2]

[METRIC_3]

[SUBJECTIVE_METRICS]

[AUTOMATED_METRICS]

[DECISION_RULE — e.g., maximizes overall score while staying under cost limit, scores highest on METRIC_1 above all else]

How to use this prompt

Copy the prompt

Click "Copy Prompt" above to copy the full prompt text to your clipboard.

Replace the placeholders

Swap out anything in [BRACKETS] with your specific details.

Paste into GPT-4o

Open your preferred AI assistant and paste the prompt to get started.

Model GPT-4o

Category ⚙️ Technical

Difficulty Advanced

Copies 0

Added May 27, 2026

LLM Benchmark Design for Your Use Case

The Prompt

📝 Fill in the blanks

How to use this prompt

You might also like

Code Reviewer & Refactor

Weekly Performance & Goal-Setting Plan for Tech Support Agents

Technical Debt & Legacy Code Refactor

Complex Regex & Pattern Matching Builder