How to Evaluate an AI Agent Before You Trust It in Production

The agent works in the demo. It handled the five examples you gave it. The team is excited.

Then it goes to production and does something no one anticipated. A misrouted customer request. A hallucinated data field in a downstream report. A loop that triggered 40 LLM calls on a malformed input and sent an alert to your CFO.

This is the gap between “it works” and “it’s ready.” And closing that gap requires a deliberate evaluation discipline — not more testing of happy paths, but structured evaluation of the failure modes that actually matter in production.

This post is for CTOs, VPs of Engineering, and staff engineers who are operationalizing AI agents. We’ll cover what rigorous agent evaluation looks like, the metrics that separate production-ready agents from demo-ready ones, and how to build an eval gate that actually catches regressions.

Why Standard Software Testing Isn’t Enough

Traditional software testing operates on deterministic systems. Given the same input, you get the same output. Unit tests verify logic. Integration tests verify interfaces. Your CI pipeline tells you definitively whether something is broken.

AI agents don’t work this way. They’re probabilistic by nature. The same input can produce different outputs across runs. A perfectly correct response on Tuesday can become a subtly wrong response on Thursday after a model provider pushes an update. Prompt changes that look minor can cause capability regressions that don’t surface until production.

The failure modes are also qualitatively different:

Hallucination: The agent returns a confident, well-formatted, wrong answer.
Task refusal: The agent declines a valid request because the prompt is over-constrained.
Routing failures: The agent sends the right type of request to the wrong downstream system.
Graceful degradation failures: The agent fails silently instead of escalating when it should.
Cost regressions: A prompt change causes the agent to make 3x more tool calls per invocation.

None of these are caught by unit tests on your tool functions. They require evaluation against representative inputs with explicit success criteria.

The Four Evaluation Layers

A production-grade agent evaluation framework has four layers. Each catches a different class of failure.

Layer 1: Deterministic unit tests

The foundation. These are traditional tests on the deterministic components of your system:

Tool function correctness (given this GitHub API response, does parse_issue() return the right struct?)
State transformation logic (does route_request() return "escalate" when confidence is below the threshold?)
Input/output schema validation (does the agent’s output conform to the expected Pydantic model?)

These should live in your CI pipeline and block merges on failure. They’re necessary but not sufficient — they tell you your plumbing works, not that your agent behaves correctly.

Minimum bar: >90% line coverage on all deterministic functions. All tool parsers, routers, and state transformers covered.

Layer 2: LLM-as-judge evaluation

This is where you test the LLM-dependent behavior using a separate LLM as an automated evaluator. For each test case in your dataset:

Run the agent on the input
Pass the agent’s output to a judge LLM with a structured scoring prompt
Record the score (binary pass/fail or 1–5 scale) with reasoning

The judge LLM evaluates dimensions like:

Task completion: Did the agent accomplish what was asked?
Factual accuracy: Is the response grounded in the provided context, or did it hallucinate?
Format adherence: Did the output match the required schema or structure?
Tone and appropriateness: (For customer-facing agents) Is the response appropriate for the audience?

This approach scales to hundreds of test cases without requiring human review of every output. The judge isn’t perfect, but it’s consistent and auditable — you can review the judge’s reasoning for any flagged case.

Minimum bar: 25 representative cases for a PR gate; 100 cases for a production promotion audit.

Layer 3: Golden dataset regression testing

Maintain a curated set of “golden” examples — inputs with known-correct, human-verified outputs. These are your regression anchors. Every time you change a prompt, update a model, or modify agent logic, run the golden dataset and compare output similarity to the reference.

Golden datasets should include:

Happy path examples: Straightforward cases the agent should handle easily
Edge cases: Inputs at the boundary of the agent’s capability
Known past failures: Cases that caused production incidents — once fixed, add them to the golden set so they can’t regress

Track your golden dataset score over time. If it drops more than 5 percentage points after a change, that’s a regression signal — even if the new behavior looks correct on new examples.

Minimum bar: 10–20 golden cases per agent. Review and expand quarterly.

Layer 4: Production shadow testing

Before fully promoting a new agent version to production, run it in shadow mode alongside the current version for a defined soak period. Shadow mode means:

The current version handles live traffic and produces real outputs
The new version processes the same inputs in parallel but its outputs are logged, not acted on
You compare outputs across a statistically meaningful sample (minimum 50–100 live requests)

Shadow testing catches the failure modes that synthetic datasets miss: unusual input distributions, edge cases your dataset authors didn’t anticipate, real-world timing and latency behaviors.

Minimum bar: 24-hour shadow soak for minor agent updates; 72-hour soak for major revisions or model changes.

The Metrics That Actually Matter

Not all metrics are equally useful. Here’s a two-tier framework for what to measure:

Hard gates (must pass to ship)

These are binary. If an agent fails any of these, it does not go to production, no exceptions:

Metric	Threshold	What it catches
Task completion rate	≥ 90% on eval dataset	Basic capability
Hallucination rate	≤ 5%	Factual reliability
Schema conformance	100%	Integration breakage
Max-iteration violation rate	0%	Runaway loops

Soft gates (monitor closely, investigate regressions)

These don’t block a release on their own, but a significant regression triggers a mandatory review:

Metric	Threshold	What it catches
Latency (p95)	≤ prior version + 20%	Performance regressions
Average cost per invocation	≤ prior version + 15%	Cost regressions from prompt changes
Escalation rate	Within ±10% of baseline	Sensitivity shifts
User satisfaction (where measurable)	No statistically significant drop	Quality regressions

The distinction matters. Hard gates protect you from shipping agents that are broken. Soft gate monitoring catches slow-moving regressions before they become incidents.

Building the Eval Dataset

The quality of your evaluation is bounded by the quality of your dataset. A 100-case eval set full of easy happy-path examples will give you high scores on an agent that fails on the cases that actually occur in production.

Here’s how to build a dataset that’s actually representative:

Start with real production examples. Once you have any production traffic, sample anonymized inputs from it. These are the inputs your agent actually encounters, not the ones you imagined it would encounter. Weight your dataset toward high-frequency patterns.

Deliberately include failure modes. For every failure mode you can anticipate — malformed inputs, ambiguous requests, inputs that could trigger hallucination, edge cases at classification boundaries — include 3–5 examples. These are the cases your agent needs to handle gracefully.

Add examples from past incidents. Every production incident that traced back to an agent failure should contribute at least 2–3 cases to your eval dataset. This is how you prevent regressions on the exact failures that hurt you before.

Include adversarial cases. Inputs designed to trip up the agent: prompt injection attempts, requests that ask the agent to do something outside its scope, inputs with subtly incorrect context. Knowing how your agent fails under adversarial conditions is production-relevant information.

Synthetic bootstrapping for new agents. If you don’t have production traffic yet, generate synthetic examples using a capable LLM with explicit instructions to generate diverse, realistic cases including edge cases. Have a human review the dataset before using it as a gate. Synthetic datasets are better than nothing, but they’re not a substitute for production data.

The Eval Gate in Practice

Here’s what a practical eval gate looks like in a CI/CD pipeline:

# .github/workflows/ci-eval.yml
name: Agent Eval Gate

on:
  pull_request:
    paths:
      - "agents/**"
      - "prompts/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run eval dataset
        run: |
          python scripts/run_eval.py \
            --agent ${{ env.AGENT_SLUG }} \
            --dataset datasets/${{ env.AGENT_SLUG }}-eval.jsonl \
            --output eval-results.json
        env:
          LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      
      - name: Check hard gates
        run: python scripts/check_eval_gates.py eval-results.json
      
      - name: Post results to PR
        uses: actions/github-script@v7
        with:
          script: |
            const results = require('./eval-results.json');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: formatEvalResults(results)
            });

The gate runs on any PR that touches agent code or prompts. It posts results directly to the PR so reviewers can see the eval impact of the change before approving. Hard gate failures block the merge.

This takes an afternoon to set up and prevents the class of regressions that are hardest to catch in code review — the ones where the logic looks fine but the LLM behavior changed.

What to Do When Your Agent Fails Eval

Failing an eval gate is not a bad outcome. It means you caught a problem before production. Here’s how to handle it:

Investigate the specific failures first. Don’t immediately adjust the prompt or change the model. Look at the cases that failed and understand why they failed. Is it a capability gap? A data issue? A prompt ambiguity? The root cause determines the fix.

Check if the failures are real or dataset quality issues. Sometimes eval failures reveal problems with the eval dataset rather than the agent. A case that’s scored as failing because the judge LLM misunderstood the task is a dataset quality issue, not an agent issue. Fix the case before fixing the agent.

Make one change at a time. If you adjust the prompt and the model simultaneously, you can’t tell which change fixed the failure or which introduced a new one. Treat agent debugging the same as any other debugging: isolate variables.

Re-run the full eval after each change. Don’t spot-check the failing cases after your fix. Run the full dataset — a fix that addresses 3 failing cases sometimes introduces 2 new failures elsewhere.

Document the root cause and the fix. Agent failure patterns are institutional knowledge. The next person who encounters a similar failure mode should be able to find what caused it and how it was resolved.

Keeping Production Agents Honest

Evaluation isn’t just a pre-deployment gate. Production agents drift. Model providers push updates. Prompt behaviors change over time. Data distributions shift.

Weekly golden dataset spot-checks. Run your golden dataset against your production agent once a week. Track the score over time. If you see a trend downward before any planned change, you have a signal worth investigating.

Monitor escalation rate as a proxy metric. For agents with a confidence-based escalation mechanism, the escalation rate is a useful proxy for agent health. A sudden increase in escalation rate often means the agent is encountering inputs it wasn’t designed for — or that model behavior has shifted.

Log cases that hit edge conditions. Every time your agent hits a max-iteration guard, fails schema validation, or triggers an error handler, log the input. Review these logs weekly. They’re a real-time signal of where your agent’s boundaries are and how often real traffic is hitting them.

Run a quarterly eval audit. Every 90 days, review your eval dataset for staleness. Are the cases still representative of production traffic? Have you added cases from recent incidents? Are your hard gate thresholds still calibrated to current risk tolerance?

The Evaluation Maturity Progression

If you’re starting from scratch, here’s the progression that makes sense:

Month 1: Unit tests on all deterministic functions + 25-case LLM-as-judge eval for your primary agent. Run manually before every release.

Month 2: Automate the eval gate in CI. Add 10 golden cases. Start logging edge-condition hits.

Month 3: Expand eval dataset to 100 cases using real production samples. Add the second-tier soft-gate metrics. Implement weekly golden dataset spot-checks.

Month 6: Full shadow-testing protocol for major releases. Quarterly eval audits. All agents in your fleet covered by their own eval dataset.

This progression is achievable without a dedicated ML platform team. It’s engineering discipline applied to a probabilistic system — harder than testing deterministic software, but tractable if you build the infrastructure incrementally.

The Cost of Not Doing This

An agent that fails gracefully in production — escalating to a human when it’s uncertain, logging its reasoning, respecting its operational boundaries — is an asset. An agent that fails silently, confidently, and at scale is a liability.

The cost of a production incident from an unvalidated agent isn’t just the direct impact of the failure. It’s the investigation time, the trust deficit with stakeholders, the engineering hours spent firefighting instead of building, and the organizational momentum lost when leadership pulls back on AI investment after a high-profile failure.

The evaluation discipline described here is not expensive. A robust eval framework for a single agent takes 2–3 days to build properly. The cost of the incidents it prevents is measured in weeks.

Build Agents You Can Actually Trust in Production

The Diagnostic Sprint includes a full assessment of your agent architecture, evaluation gaps, and operational runbook — then gives you the framework to run it yourself. 4–6 weeks. Full knowledge transfer.

Learn About the Diagnostic Sprint