GPT-4o
⚙️ Technical
Advanced
AI Agent Evaluation Framework
Design a rigorous framework for evaluating AI agent performance, reliability, and safety before production deployment.
The Prompt
# AI Agent Evaluation Framework You are an AI systems reliability engineer. Design a comprehensive evaluation framework that rigorously tests an AI agent's performance, reliability, and safety before it goes live in production. ## Agent Under Evaluation **Agent name / purpose:** [AGENT_NAME] — [AGENT_DESCRIPTION] **Deployment environment:** [ENVIRONMENT — e.g., customer-facing chatbot, internal automation, autonomous data processor] **Risk level:** [RISK — Low / Medium / High / Critical] **Stakeholders who must sign off:** [APPROVERS] ## Evaluation Dimensions ### 1. Task Performance Define [TEST_COUNT] representative tasks the agent must complete correctly. - For each task: provide input, specify correct output, set pass/fail criteria - Required pass rate: [PASS_RATE]% to proceed to next evaluation phase ### 2. Edge Case Handling Test the agent against [EDGE_CASE_COUNT] adversarial or unusual inputs: - Empty or null inputs - Inputs in unexpected languages or formats - Extremely long or extremely short inputs - Inputs designed to elicit [UNSAFE_BEHAVIORS] Document expected behavior for each edge case. ### 3. Reliability Under Load Run [CONCURRENT_REQUEST_COUNT] simultaneous requests and measure: - Response time P50, P95, P99 - Error rate - Output consistency (does the agent give the same answer to the same question?) - Acceptable thresholds: [THRESHOLDS] ### 4. Safety and Alignment For agents with [RISK] risk level, verify: - Agent refuses to perform [FORBIDDEN_ACTION_LIST] - Agent escalates to human when confidence < [CONFIDENCE_THRESHOLD] - Agent never takes irreversible actions without [CONFIRMATION_MECHANISM] ## Go / No-Go Decision Matrix Define the criteria that must ALL be met before production deployment: [GO_CRITERIA_LIST]
📝 Fill in the blanks
Replace these placeholders with your own content:
[AGENT_NAME]
[AGENT_DESCRIPTION]
[ENVIRONMENT — e.g., customer-facing chatbot, internal automation, autonomous data processor]
[RISK — Low / Medium / High / Critical]
[APPROVERS]
[TEST_COUNT]
[PASS_RATE]
[EDGE_CASE_COUNT]
[UNSAFE_BEHAVIORS]
[CONCURRENT_REQUEST_COUNT]
[THRESHOLDS]
[RISK]
[FORBIDDEN_ACTION_LIST]
[CONFIDENCE_THRESHOLD]
[CONFIRMATION_MECHANISM]
[GO_CRITERIA_LIST]
How to use this prompt
1
Copy the prompt
Click "Copy Prompt" above to copy the full prompt text to your clipboard.
2
Replace the placeholders
Swap out anything in [BRACKETS] with your specific details.
3
Paste into GPT-4o
Open your preferred AI assistant and paste the prompt to get started.