engineering devops ci-cd ai-agents langsmith github-actions

How to Build a CI/CD Pipeline for AI Agents (That Actually Works)

Agentic Runbook ·

How to Build a CI/CD Pipeline for AI Agents (That Actually Works)

Shipping a traditional microservice is hard. Shipping an AI agent is a different kind of hard.

With a microservice, “does it work” has a binary answer. With an AI agent, “does it work” means: does it produce the right output, in the right format, within acceptable latency and cost bounds, on a distribution of inputs you’ve never fully enumerated — and does it still do that after you updated the prompt?

A CI/CD pipeline that doesn’t account for this will fail you on a Tuesday at 2 AM when a prompt tweak causes your @finance-bot to start formatting invoice numbers incorrectly.

This post describes the three-stage CI/CD architecture we’ve standardized across every agent we build. It’s opinionated, battle-tested, and designed to be stolen.


Why Traditional CI/CD Breaks for Agents

Traditional CI gates check:

  • Does the code compile?
  • Do the unit tests pass?
  • Does the Docker image build?

For a deterministic service, that’s enough. For an AI agent, it’s nowhere near enough. Here’s what’s missing:

1. The “right answer” is probabilistic.
An agent that passes unit tests today might regress next week when a model provider updates their weights. Your CI gate needs to catch model drift — not just code errors.

2. Prompts are code — but they don’t live in your linter.
A prompt change can completely alter agent behavior without touching a single .py file. If your pipeline doesn’t version and test prompts, you’re flying blind.

3. Cost is a first-class concern.
A regression that causes your agent to call GPT-4o three times per invocation instead of once costs real money at scale. LLM cost needs to be a gate, not an afterthought.

4. Eval datasets rot.
If your CI eval dataset isn’t growing, it’s getting less representative. A static 25-record golden set stops being meaningful after three months of production traffic.


The Three-Stage Pipeline

We use a three-stage model for every agent we ship. All three stages run in GitHub Actions.

Stage 1: PR Gate

Trigger: pull_request → any branch targeting main
Must pass before merge.

Four workflows run in parallel:

ci-lint.yml — Code quality gate

  • ruff for Python linting (< 5 seconds)
  • mypy for type checking
  • bandit for security anti-patterns (catches hardcoded secrets, shell injection)
  • Prompt file existence check: every agent must have agents/{slug}/prompts/system.txt

ci-test.yml — Unit and integration tests

  • pytest with 80% coverage floor
  • Node-level unit tests (mocked LLM responses — never call real APIs in CI)
  • Integration tests against test fixtures (Qdrant test collection, stub Slack webhook)
  • Timeout: 10 minutes hard cap

ci-eval.yml — LLM evaluation gate

  • Loads golden eval dataset from LangSmith (ar-{slug}-eval-v*)
  • Minimum dataset size: 25 records (blocks merge if below threshold)
  • Runs 4 P0 hard-gate metrics:
    • Task completion rate ≥ 90%
    • Format compliance ≥ 95%
    • Hallucination rate ≤ 5% (LLM-as-judge)
    • Tool call accuracy ≥ 85%
  • Any P0 failure: merge blocked, no exceptions
  • Runs 4 P1 soft-gate metrics (warning only, not blocking):
    • P95 latency
    • Token efficiency (output tokens / task value score)
    • Self-consistency (same input → consistent output across 3 runs)
    • Adversarial robustness score

ci-registry-check.yml — Fleet registry validation

  • Validates agents/registry.yaml against schema (14 required fields)
  • Slug uniqueness check
  • Directory alignment check (every directory has a registry entry, every registry entry has a directory)
  • Status lifecycle guard (active requires Fleet YAML)
  • Blocks merge on any failure

Stage 2: Merge to Main

Trigger: Push to main (after PR merge)
Automatic.

cd-staging.yml — Staging deployment

  • Builds Docker image, tags with commit SHA
  • Deploys to staging Fleet instance
  • Runs smoke test: GET /health → expect {"status": "healthy"}
  • Posts result to #exec Slack channel: commit SHA + health check status
  • If health check fails: auto-revert via git revert + alert to #exec

Staging is ephemeral — it’s not a long-lived environment. We use it for a 15-minute soak before production promotion.

Stage 3: Production Promotion

Trigger: Manual approval gate
Requires CTO or founder approval in GitHub.

cd-production.yml — Production deployment

  • Requires 1 approver on the production deployment workflow
  • Deploys to production Fleet instance
  • Updates agents/registry.yaml status field: deploy_readyactive
  • Creates git tag: {agent-slug}/v{semver} (immutable)
  • Posts go-live confirmation to #exec
  • Activates @ops-bot health monitoring (5-minute polling interval)

Why a manual gate?
We’ve seen teams break prod because they forgot to update an env var, or the staging soak hadn’t caught a slow edge-case regression. The manual gate adds 60 seconds and catches these. Worth it every time.


Prompt Versioning

Prompts live in agents/{slug}/prompts/. Every prompt file is:

  • Version-controlled in git (same branch/PR discipline as code)
  • Pinned in AGENTS.md via prompt_sha: <sha> field
  • Tested in CI via the eval gate (prompt changes must pass P0 metrics before merge)

When you update a prompt, you’re creating a PR. That PR runs the full eval suite against the new prompt. If task completion drops below 90%, the PR fails. You fix the prompt or update the eval dataset (with justification in the PR description), not bypass the gate.

This is the discipline that prevents “works in staging, breaks in prod because someone tweaked the system prompt.”


Eval Dataset Governance

The eval dataset is the hardest part to get right. Common failure modes:

Too small. A 25-record dataset is the minimum floor for a PR gate. By the time an agent hits production, you should be at 100+ records. If you’re not, your coverage is a false confidence.

Not adversarial enough. Happy-path inputs are easy. The eval dataset needs edge cases: empty inputs, ambiguous requests, inputs that should trigger a refusal, inputs that are just outside the agent’s scope.

Not updated after production incidents. Every production incident is an eval record waiting to be written. When an agent misbehaves in production, the first thing you do is write a test case that reproduces the failure. Then you fix it. This is how the dataset stays representative.

Our process:

  1. Start with 25 synthetic records (bootstrapped from the agent spec using a generation script)
  2. Add 5 records per week from real-world production traces (anonymized per ADR-029)
  3. At 100 records, switch to a Transfer-phase audit that reviews the full dataset for drift and coverage gaps

Cost as a CI Gate

Token cost is a P2 soft-gate metric in our pipeline — it doesn’t block a merge, but it generates a warning that appears in the PR. Here’s why we track it at the PR level:

A seemingly minor prompt change — adding two sentences of context — can increase average token count by 15% across all invocations. At 10,000 invocations/week, that’s real money. Catching it at PR time (when the delta is obvious) is far better than discovering it in the weekly cost roll-up.

We use LangSmith’s token tracking per run, normalized by the eval dataset, and compare against the main branch baseline. If cost per invocation increases more than 20% from baseline, it’s flagged. Not blocked — but the engineer has to explicitly acknowledge it in the PR.


Rollback Procedure

When things go wrong in production (they will):

  1. Identify@ops-bot P1 alert fires (error rate > 5% over 15 minutes)
  2. Isolate — disable the Fleet agent’s webhook endpoint (takes 2 minutes)
  3. Assess — is this a code regression or a model drift issue?
  4. Rollbackgit revert {merge-commit-sha} → PR → fast-track merge (eval gate still runs, takes ~3 minutes) → automatic staging + production deploy
  5. VerifyGET /health healthy, P1 alert clears
  6. Post-mortem — within 24 hours: root cause, eval gap identified, new test case written

Target MTTR for P1: ≤ 30 minutes. The rollback procedure itself takes ~10 minutes. The remaining 20 minutes is diagnosis and communication overhead — which is where runbooks pay off.


The Registry as Source of Truth

Every CI pipeline we’ve described references agents/registry.yaml — the fleet registry. This is intentional. The registry is the authoritative record of:

  • What agents exist
  • Their current status (staged / deploy_ready / active / deprecated)
  • Which LangSmith project and eval dataset are associated
  • Who owns them

The ci-registry-check.yml workflow enforces registry hygiene on every PR. You cannot merge a new agent directory without a corresponding registry entry. You cannot mark an agent active without a Fleet YAML. The registry doesn’t drift because the CI won’t let it.


Summary

The CI/CD architecture that works for AI agents has three non-negotiable components traditional pipelines lack:

  1. An eval gate — LLM-as-judge metrics against a golden dataset, P0 blocking, P1 advisory, runs on every PR
  2. Prompt versioning — prompts in git, pinned in AGENTS.md, tested like code
  3. A manual production promotion gate — 60 seconds of human judgment before the agent goes live

Everything else — lint, unit tests, staging deploys, health checks — is standard DevOps discipline applied to a new artifact type.

The teams that skip the eval gate and the production gate are the same teams that ship a prompt tweak on Friday afternoon and spend Saturday rolling it back. The pipeline overhead is low. The incident cost is high. The math is obvious.

Ready to operationalize your agent fleet?

The Diagnostic Sprint identifies your highest-leverage agentic use cases and delivers the first production-ready agents — with full CI/CD, eval gates, and runbooks included.

Start the Diagnostic Sprint

Ready to build your agentic team?

Start with a Diagnostic Sprint — a 2–4 week structured audit that produces your prioritized Agentic Roadmap.

Start with a Diagnostic →