GPT-4o
⚙️ Technical
Advanced
LLM Benchmark Design for Your Use Case
Design a custom benchmark to objectively compare AI models on your specific tasks before committing to one.
The Prompt
# LLM Benchmark Design for Your Use Case You are an AI evaluation engineer. Design a custom benchmark that objectively tests and compares multiple large language models on your specific use case — producing data-driven model selection instead of relying on generic leaderboards that may not reflect your real-world needs. ## Use Case **Task I need an LLM for:** [TASK_DESCRIPTION] **Volume:** approximately [VOLUME] tasks per [PERIOD] **Quality requirements:** [QUALITY_STANDARD] **Budget constraint:** maximum [COST_LIMIT] per [UNIT] **Models to compare:** [MODEL_LIST] ## Benchmark Design ### Test Set Construction Build a test set of [TEST_SIZE] representative tasks. The test set must include: - [PCT_EASY]% straightforward cases (typical inputs) - [PCT_MEDIUM]% moderately complex cases - [PCT_HARD]% edge cases and adversarial inputs - [PCT_REAL]% real examples from your actual use (anonymized) ### Evaluation Metrics Define [METRIC_COUNT] measurable metrics for this task: | Metric | How Measured | Weight | Acceptable Threshold | |--------|-------------|--------|---------------------| | [METRIC_1] | [MEASUREMENT_METHOD] | [WEIGHT]% | [THRESHOLD] | | [METRIC_2] | [MEASUREMENT_METHOD] | [WEIGHT]% | [THRESHOLD] | | [METRIC_3] | [MEASUREMENT_METHOD] | [WEIGHT]% | [THRESHOLD] | ### Scoring Protocol - Human evaluation for: [SUBJECTIVE_METRICS] - Automated evaluation for: [AUTOMATED_METRICS] - Blind scoring: evaluators should not know which model produced which output ## Results Template Present results as: | Model | [METRIC_1] | [METRIC_2] | [METRIC_3] | Cost/Task | Overall Score | |-------|-----------|-----------|-----------|----------|--------------| ## Decision Rule Select the model that: [DECISION_RULE — e.g., maximizes overall score while staying under cost limit, scores highest on METRIC_1 above all else].
📝 Fill in the blanks
Replace these placeholders with your own content:
[TASK_DESCRIPTION]
[VOLUME]
[PERIOD]
[QUALITY_STANDARD]
[COST_LIMIT]
[UNIT]
[MODEL_LIST]
[TEST_SIZE]
[PCT_EASY]
[PCT_MEDIUM]
[PCT_HARD]
[PCT_REAL]
[METRIC_COUNT]
[METRIC_1]
[MEASUREMENT_METHOD]
[WEIGHT]
[THRESHOLD]
[METRIC_2]
[METRIC_3]
[SUBJECTIVE_METRICS]
[AUTOMATED_METRICS]
[DECISION_RULE — e.g., maximizes overall score while staying under cost limit, scores highest on METRIC_1 above all else]
How to use this prompt
1
Copy the prompt
Click "Copy Prompt" above to copy the full prompt text to your clipboard.
2
Replace the placeholders
Swap out anything in [BRACKETS] with your specific details.
3
Paste into GPT-4o
Open your preferred AI assistant and paste the prompt to get started.