Understanding AI Agent Evals

Why This Matters to You

Why You Should Care About This as an AI PM

Here's the thing about AI products: they work 80% of the time in your demo, then mysteriously fail in production at the worst possible moment. You've seen it. Your sales team gives a perfect demo to a customer. Two weeks later, that same customer emails saying "it's not working right."

This is where evaluations (evals) come in. Think of evals as your quality control system before launch. But here's what makes AI evals different from traditional software testing—your AI agent might give a different answer every single time, even with identical inputs. So how do you test that?

Anthropic (the team behind Claude) just published their playbook after working with dozens of teams building AI agents. Their main insight? Most teams wait too long to build evals, then scramble to figure out why their agent keeps failing. Let me translate what they learned into stuff you can actually use as a PM.

The TL;DR for PMs:

Evals are how you test AI products before they break in production
You need different types of graders: code-based (fast, cheap), model-based (flexible), and human (gold standard)
Capability evals help you improve; regression evals prevent backsliding
Start early with 20-50 simple tests based on real failures
AI non-determinism means you need new metrics like pass@k (will succeed in k tries)

Key Takeaways

What You Need to Know (Broken Down Simply)

1. What Are Evals, Really?

What Anthropic says: "Evaluations are test suites that measure agent capabilities and catch regressions before deployment."

What this means for you: Evals are like unit tests, but for AI behavior. Instead of checking if a function returns "42", you're checking if your AI agent can book a flight correctly, or if it hallucinates fake data.

Here's the structure they recommend:

Task: A single test case (e.g., "Extract invoice total from this PDF")
Trial: Running that task multiple times (because AI is non-deterministic)
Grader: Logic that scores whether the agent succeeded
Transcript: Complete record of what the agent did (tool calls, reasoning, output)

PM decision point: You need evals BEFORE launch, not after. Build them as you build the product.

2. Three Types of Graders (And When to Use Each)

This is where it gets practical. You have three ways to grade your AI agent's performance:

Code-based graders: Old-school programming checks

Examples: String matching, checking if output is valid JSON, verifying a file was created
Pros: Fast, cheap, deterministic, easy to debug
Cons: Brittle—fails when AI finds a valid but different solution
When to use: Clear right/wrong answers (e.g., "did the agent call the correct API?")

Model-based graders: Use another LLM to grade the output

Examples: Ask Claude to score if response is "helpful and accurate on a 1-10 scale"
Pros: Flexible, handles nuance, scales well
Cons: Non-deterministic, costs money, needs calibration
When to use: Subjective quality (tone, completeness, clarity)

Human graders: Actual people reviewing outputs

Examples: Subject matter experts, crowdsourced reviews
Pros: Gold standard, catches edge cases
Cons: Expensive, slow, doesn't scale
When to use: Calibrating your model-based graders, spot-checking production

PM decision point: Start with code-based graders for speed. Add model-based graders for quality. Use humans to validate both are working.

3. Capability vs Regression Evals

Capability evals: Tests where you expect to fail (for now)

Purpose: Measure improvement over time
Example: "Can the agent handle 10-step workflows?" (currently: 20% pass rate, goal: 80%)
You run these to see if your changes are making the agent smarter

Regression evals: Tests you should pass every time

Purpose: Don't break what's already working
Example: "Can the agent still format dates correctly?" (should be: 100% pass rate)
These are your safety net in CI/CD

Here's the lifecycle: Today's capability eval (60% pass rate, improving) becomes tomorrow's regression eval (must maintain 100%). It's like graduating a feature from beta to stable.

PM decision point: Track both. Capability evals tell you if you're getting better. Regression evals tell you if you broke something.

4. Agent-Specific Eval Strategies

Different types of agents need different eval approaches. Anthropic breaks it down:

Coding agents: Use deterministic graders

Run the code, check if tests pass
Verify files were created/modified correctly
Use static analysis to check code quality
Example benchmark: SWE-bench (GitHub issues dataset)

Conversational agents: Use simulated users + multi-dimensional success

Have a second LLM play the customer role
Measure: Did it solve the issue? In how many turns? Was the tone appropriate?
Example: Customer support bot resolves return request in <5 messages with polite tone

Research agents: Combine multiple grader types

Check: Are claims grounded in sources? Are sources high-quality? Is coverage comprehensive?
Needs frequent calibration with domain experts
Highly subjective—"comprehensive" means different things to different people

Computer-use agents: Run in sandboxed environments

Give the agent a browser or OS, measure if it completes tasks
Trade-off: Screenshots (faster) vs DOM extraction (more tokens, slower)
Example benchmarks: WebArena, OSWorld

5. Understanding Non-Determinism Metrics

This is where AI evals differ from traditional software testing. Your agent might succeed on retry 3 but fail on retry 1. So how do you measure reliability?

pass@k: What's the chance you succeed in k attempts?

If pass@1 = 50%, you succeed on first try half the time
If pass@5 = 80%, you succeed within 5 tries 80% of the time
Use this when: You can afford retries (batch jobs, background tasks)

pass^k: What's the chance all k trials succeed?

If each trial has 75% success, then pass^3 = 0.75³ ≈ 42%
More conservative—tells you consistency
Use this when: You need reliability (customer-facing, real-time)

PM decision point: Know which metric matches your use case. A research tool can tolerate retries. A customer chatbot cannot.

6. Common Pitfalls (And How to Avoid Them)

Anthropic shares real mistakes teams make. Here's what to watch for:

Over-specifying graders: Don't check every step, just the outcome. Agents might find valid approaches you didn't anticipate.
Shared state between trials: Each test should run in a clean environment, or else the agent "learns" from previous runs and inflates scores.
Ambiguous tasks: If two domain experts can't agree on the right answer, your eval is broken—not your agent.
Class imbalance: Test where behavior should AND shouldn't occur. (Example: Test that agent triggers search when needed AND doesn't trigger when not needed)
Taking scores at face value: Always read transcripts. Understand why the agent failed. Your grader might be rejecting valid solutions.
Ignoring saturation: If you hit 100% pass rate, the eval is no longer useful for improvement. You need harder tests.

What You Can Do

What You Can Do After Reading This

If you're defining requirements for AI features:

Write 5-10 example tasks the agent should handle successfully (these become your first evals)
For each task, define what "success" looks like—be specific but not over-specified
Identify which grader type fits: code-based for deterministic outcomes, model-based for quality, human for calibration
Add "build evals" as a deliverable in your sprint planning—not a nice-to-have, a requirement
Ask: Do we need pass@k (retries OK) or pass^k (reliability critical)?

If you're working with engineering on AI products:

Share this article's framework with your team
Propose: Start with 20-50 simple tests based on past failures or edge cases
Discuss: Which eval harness should we use? (Anthropic mentions Harbor, Promptfoo, Braintrust, LangSmith)
Set a rule: No changes to production without passing regression evals
Schedule weekly transcript reviews—spend 30 minutes reading actual agent failures together
Calibrate your model-based graders against human judgment monthly

If you're building your first AI agent:

Start evals NOW, even with just 10 test cases from your development testing
Use a mix: 70% code-based (fast feedback), 20% model-based (quality), 10% human (spot-check)
Run evals in CI/CD before every deployment
Track metrics over time: Are we improving? Are we backsliding?
Read Anthropic's original article (link below) and their cookbook for technical examples
Remember: It's easier to build evals alongside development than to reverse-engineer them from production

The meta-lesson: Evals aren't just testing—they're how you build confidence in your AI product. Teams with strong evals ship faster because they know what works and what doesn't. Teams without evals debug production fires reactively.

First Principles

Understanding This from First Principles

Let me break down why evals matter and how they work, from the ground up.

Why do evals exist in the first place?

Traditional software is deterministic: same input → same output, always. You test once, it passes, you're confident.

AI agents are non-deterministic: same input → different outputs, based on randomness (temperature, sampling). You test once, it passes, but that tells you almost nothing about the next run.

Evals solve this by testing multiple times and measuring statistical reliability. Instead of "does it work?", you ask "what's the probability it works?" This is a fundamental shift in how we think about quality.

What's the core trade-off?

You have three dimensions in eval design:

Speed: Code-based graders run in milliseconds. Model-based take seconds. Humans take hours.
Cost: Code is free. Models cost money per eval. Humans are expensive.
Quality: Code is brittle. Models are flexible but noisy. Humans are the gold standard.

You can't maximize all three. The art of evals is choosing the right tool for each test case. Use code for known patterns, models for quality checks, humans for calibration.

When do you need evals?

Anthropic's teams discovered evals are valuable at three stages:

Pre-launch (development): Catch failures before they reach users. This is where most teams should start.
CI/CD (deployment): Automated gate—regression evals must pass or deployment blocks. Prevents backsliding.
Production (monitoring): Sample live traffic, run evals offline, detect drift. Your agent might degrade over time as user patterns change.

Think of evals as a Swiss cheese model: Multiple layers of defense. Automated evals catch 80% of issues. A/B testing catches 15%. User feedback catches the last 5%. No single method is perfect, but together they create reliability.

Why is this hard for AI products specifically?

Because we're testing behavior, not code. In traditional software, you control the logic. In AI, you control the inputs and hope the behavior emerges.

This means:

Your agent might be right in ways you didn't anticipate (hence: don't over-specify graders)
Your agent might fail differently each time (hence: run multiple trials)
Your graders might disagree with reality (hence: calibrate with humans)
Your evals might become obsolete as the agent improves (hence: graduation from capability → regression)

The mindset shift: You're not testing if code runs correctly. You're measuring probabilistic reliability of intelligent behavior. That requires new tools, new metrics, and new thinking.

Read the Original Article

"Demystifying evals for AI agents"
By Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe
Published by Anthropic Engineering on January 9, 2026

Read Original on Anthropic →

This rewrite is my interpretation for AI Product Managers. For technical implementation details, deep dives on eval harnesses, and code examples, read Anthropic's original piece and their cookbook.