The complete guide to Claude Code setup. 100+ hours saved. 370x optimization. Production-tested patterns for skills, hooks, and MCP integration.
Testing AI agents is fundamentally different from testing traditional software. Agents produce non-deterministic outputs, interact with tools in unpredictable ways, and may give correct answers in unexpected formats. This chapter covers Anthropic’s 6 evaluation best practices for building reliable agent test suites.
Purpose: Build robust evaluation systems for AI agents and LLM-powered features Source: Anthropic “Demystifying Evals for AI Agents” + evaluator_optimizer.ipynb Difficulty: Intermediate Time: 2-4 hours for initial implementation
Traditional test: assert output === expected – binary, deterministic.
Agent test: The agent might query a database, call 3 tools, format the answer in Hebrew, and return the right number wrapped in a sentence. A strict string match fails despite a correct answer.
Key insight: Evaluate OUTCOMES, not exact outputs.
Not all test queries are equal. Separate them into two categories:
Queries that must always pass. These are well-understood, stable capabilities. A failure here is a bug.
const REGRESSION = [
"כמה עובדים?", // Employee count -- always works
"שלום", // Greeting -- always works
"מה שעות העבודה?", // Work hours -- always works
];
Queries that may fail sometimes. These test emerging or complex capabilities. A failure here is informational, not a blocker.
const CAPABILITY = [
"פרט את העלויות לפי מחלקה", // Department breakdown -- complex
"השווה הכנסות בין סניפים", // Branch comparison -- complex
"למה ירדה הרווחיות?", // Why analysis -- very complex
];
Without tagging, you treat all failures equally. A greeting failure (critical bug) gets the same priority as a complex comparison failure (known limitation). Tagging lets you:
Instead of matching exact strings, verify the outcome is correct.
Query: "כמה עובדים?"
Expected: "129 עובדים"
Actual: "יש 129 עובדים פעילים"
Result: FAIL (string mismatch) ← Wrong! The answer is correct.
Extract the key value and validate it against the real data source:
async function validateEmployeeCount(response, pool) {
// Get ground truth from database
const { rows } = await pool.query("SELECT COUNT(*) FROM employees");
const expected = parseInt(rows[0].count);
// Extract number from response (any format)
const actual = extractNumber(response);
// Outcome check: is the number correct?
return { pass: Math.abs(actual - expected) <= 1, expected, actual };
}
| Strategy | When to Use | Example |
|---|---|---|
| Exact match | Structured outputs (JSON, SQL) | API response format validation |
| Number extraction | Numerical answers | Employee count, revenue totals |
| Keyword presence | Qualitative answers | “Contains Hebrew greeting” |
| Database-backed | Data accuracy | Compare response number to DB count |
| LLM-as-judge (see #4) | Complex quality assessment | “Is this a natural Hebrew response?” |
A query that passes 4 out of 5 times is unreliable for users. Measure consistency, not just capability.
97% pass rate:
pass@5 = 1 - (0.03)^5 = 99.99% (almost certain to work once)
pass^5 = (0.97)^5 = 85.9% (only 86% chance ALL 5 work)
A 97% pass rate sounds great, but users experience the 14% inconsistency.
async function runConsistency(queryFn, query, trials = 5) {
const results = [];
for (let i = 0; i < trials; i++) {
results.push(await queryFn(query));
}
const passes = results.filter((r) => r.pass).length;
return {
pass_at_k: passes > 0, // Did it work at least once?
pass_k: passes === trials, // Did it work EVERY time?
consistency_rate: passes / trials,
flaky: passes > 0 && passes < trials, // Sometimes yes, sometimes no
};
}
For nuanced quality assessment beyond pass/fail, use an LLM to grade responses on a scale.
| Criterion | 1 (Poor) | 5 (Excellent) |
|---|---|---|
| Factual Accuracy | Wrong data | Verified correct |
| Language Quality | Garbled/unnatural | Natural, grammatically correct |
| Completeness | Misses key points | Fully answers the question |
| Confidence | Over/under-confident | Appropriate certainty |
Grading every response with an LLM is expensive. Optimize:
const GRADE_STRATEGY = {
failures: true, // Always grade failures (understand why)
sample_rate: 0.1, // Grade 10% of passes (spot check)
regression_only: false, // Grade all categories
};
This means: grade all failures + 10% of successes = ~20% of total queries graded.
Grade this AI response on 4 criteria (1-5 scale):
Query: {query}
Response: {response}
Expected outcome: {expected}
Criteria:
1. Factual Accuracy: Is the data correct?
2. Language Quality: Is the Hebrew natural?
3. Completeness: Does it fully answer the question?
4. Confidence: Is the certainty level appropriate?
Return JSON: {"accuracy": N, "language": N, "completeness": N, "confidence": N, "overall": N, "issues": ["..."]}
Proactively review agent transcripts to catch issues before users report them.
| Signal | Meaning | Action |
|---|---|---|
| Timeout (>30s) | Query too complex or stuck | Investigate, add optimization |
| Zero tools called | Agent hallucinated instead of acting | Fix tool forcing, add guidance |
| Error in response | Unhandled edge case | Add error handling |
| Negative feedback | User unhappy with response | Analyze and fix pattern |
Run a weekly review of production transcripts:
Automatically promote queries from “capability” to “regression” when they’ve proven stable.
capability → saturated → regression
Graduation criteria:
- 5+ consecutive passes
- 7+ days observed
- All recent runs successful
Demotion (regression → investigation):
- Any failure in a regression query
- Demoted until fixed and re-graduated
Without saturation tracking, your regression set is static. Queries you fixed 3 months ago might never get promoted. Saturation monitoring automates this:
A meta-pattern that combines generation and evaluation into an iterative improvement loop.
Generate → Evaluate → PASS → Done
→ NEEDS_IMPROVEMENT → Feedback → Regenerate (max 3 iterations)
async function evaluatorOptimizer(generateFn, evaluateFn, maxIterations = 3) {
let output = await generateFn();
for (let i = 0; i < maxIterations; i++) {
const evaluation = await evaluateFn(output);
if (evaluation.result === "PASS") {
return { output, iterations: i + 1, status: "passed" };
}
// Feed evaluation feedback back into generation
output = await generateFn({
previousOutput: output,
feedback: evaluation.feedback,
});
}
return { output, iterations: maxIterations, status: "max_iterations" };
}
| Use Case | Generator | Evaluator |
|---|---|---|
| Prompt refinement | Write system prompt | Test against 10 sample queries |
| Code generation | Write function | Run tests against the function |
| Document writing | Draft documentation | Check completeness and accuracy |
| SQL generation | Generate SQL query | Validate against schema + test |
Source: Anthropic evaluator_optimizer.ipynb
regression or capabilityPrevious: 40: Agent Orchestration Patterns Next: 42: Session Memory and Compaction