How to Build a CI/CD Pipeline for AI Agents (That Actually Works)
How to Build a CI/CD Pipeline for AI Agents (That Actually Works)
Shipping a traditional microservice is hard. Shipping an AI agent is a different kind of hard.
With a microservice, “does it work” has a binary answer. With an AI agent, “does it work” means: does it produce the right output, in the right format, within acceptable latency and cost bounds, on a distribution of inputs you’ve never fully enumerated — and does it still do that after you updated the prompt?
A CI/CD pipeline that doesn’t account for this will fail you on a Tuesday at 2 AM when a prompt tweak causes your @finance-bot to start formatting invoice numbers incorrectly.
This post describes the three-stage CI/CD architecture we’ve standardized across every agent we build. It’s opinionated, battle-tested, and designed to be stolen.
Why Traditional CI/CD Breaks for Agents
Traditional CI gates check:
- Does the code compile?
- Do the unit tests pass?
- Does the Docker image build?
For a deterministic service, that’s enough. For an AI agent, it’s nowhere near enough. Here’s what’s missing:
1. The “right answer” is probabilistic.
An agent that passes unit tests today might regress next week when a model provider updates their weights. Your CI gate needs to catch model drift — not just code errors.
2. Prompts are code — but they don’t live in your linter.
A prompt change can completely alter agent behavior without touching a single .py file. If your pipeline doesn’t version and test prompts, you’re flying blind.
3. Cost is a first-class concern.
A regression that causes your agent to call GPT-4o three times per invocation instead of once costs real money at scale. LLM cost needs to be a gate, not an afterthought.
4. Eval datasets rot.
If your CI eval dataset isn’t growing, it’s getting less representative. A static 25-record golden set stops being meaningful after three months of production traffic.
The Three-Stage Pipeline
We use a three-stage model for every agent we ship. All three stages run in GitHub Actions.
Stage 1: PR Gate
Trigger: pull_request → any branch targeting main
Must pass before merge.
Four workflows run in parallel:
ci-lint.yml — Code quality gate
rufffor Python linting (< 5 seconds)mypyfor type checkingbanditfor security anti-patterns (catches hardcoded secrets, shell injection)- Prompt file existence check: every agent must have
agents/{slug}/prompts/system.txt
ci-test.yml — Unit and integration tests
pytestwith 80% coverage floor- Node-level unit tests (mocked LLM responses — never call real APIs in CI)
- Integration tests against test fixtures (Qdrant test collection, stub Slack webhook)
- Timeout: 10 minutes hard cap
ci-eval.yml — LLM evaluation gate
- Loads golden eval dataset from LangSmith (
ar-{slug}-eval-v*) - Minimum dataset size: 25 records (blocks merge if below threshold)
- Runs 4 P0 hard-gate metrics:
- Task completion rate ≥ 90%
- Format compliance ≥ 95%
- Hallucination rate ≤ 5% (LLM-as-judge)
- Tool call accuracy ≥ 85%
- Any P0 failure: merge blocked, no exceptions
- Runs 4 P1 soft-gate metrics (warning only, not blocking):
- P95 latency
- Token efficiency (output tokens / task value score)
- Self-consistency (same input → consistent output across 3 runs)
- Adversarial robustness score
ci-registry-check.yml — Fleet registry validation
- Validates
agents/registry.yamlagainst schema (14 required fields) - Slug uniqueness check
- Directory alignment check (every directory has a registry entry, every registry entry has a directory)
- Status lifecycle guard (
activerequires Fleet YAML) - Blocks merge on any failure
Stage 2: Merge to Main
Trigger: Push to main (after PR merge)
Automatic.
cd-staging.yml — Staging deployment
- Builds Docker image, tags with commit SHA
- Deploys to staging Fleet instance
- Runs smoke test:
GET /health→ expect{"status": "healthy"} - Posts result to
#execSlack channel: commit SHA + health check status - If health check fails: auto-revert via
git revert+ alert to#exec
Staging is ephemeral — it’s not a long-lived environment. We use it for a 15-minute soak before production promotion.
Stage 3: Production Promotion
Trigger: Manual approval gate
Requires CTO or founder approval in GitHub.
cd-production.yml — Production deployment
- Requires 1 approver on the production deployment workflow
- Deploys to production Fleet instance
- Updates
agents/registry.yamlstatus field:deploy_ready→active - Creates git tag:
{agent-slug}/v{semver}(immutable) - Posts go-live confirmation to
#exec - Activates
@ops-bothealth monitoring (5-minute polling interval)
Why a manual gate?
We’ve seen teams break prod because they forgot to update an env var, or the staging soak hadn’t caught a slow edge-case regression. The manual gate adds 60 seconds and catches these. Worth it every time.
Prompt Versioning
Prompts live in agents/{slug}/prompts/. Every prompt file is:
- Version-controlled in git (same branch/PR discipline as code)
- Pinned in
AGENTS.mdviaprompt_sha: <sha>field - Tested in CI via the eval gate (prompt changes must pass P0 metrics before merge)
When you update a prompt, you’re creating a PR. That PR runs the full eval suite against the new prompt. If task completion drops below 90%, the PR fails. You fix the prompt or update the eval dataset (with justification in the PR description), not bypass the gate.
This is the discipline that prevents “works in staging, breaks in prod because someone tweaked the system prompt.”
Eval Dataset Governance
The eval dataset is the hardest part to get right. Common failure modes:
Too small. A 25-record dataset is the minimum floor for a PR gate. By the time an agent hits production, you should be at 100+ records. If you’re not, your coverage is a false confidence.
Not adversarial enough. Happy-path inputs are easy. The eval dataset needs edge cases: empty inputs, ambiguous requests, inputs that should trigger a refusal, inputs that are just outside the agent’s scope.
Not updated after production incidents. Every production incident is an eval record waiting to be written. When an agent misbehaves in production, the first thing you do is write a test case that reproduces the failure. Then you fix it. This is how the dataset stays representative.
Our process:
- Start with 25 synthetic records (bootstrapped from the agent spec using a generation script)
- Add 5 records per week from real-world production traces (anonymized per ADR-029)
- At 100 records, switch to a Transfer-phase audit that reviews the full dataset for drift and coverage gaps
Cost as a CI Gate
Token cost is a P2 soft-gate metric in our pipeline — it doesn’t block a merge, but it generates a warning that appears in the PR. Here’s why we track it at the PR level:
A seemingly minor prompt change — adding two sentences of context — can increase average token count by 15% across all invocations. At 10,000 invocations/week, that’s real money. Catching it at PR time (when the delta is obvious) is far better than discovering it in the weekly cost roll-up.
We use LangSmith’s token tracking per run, normalized by the eval dataset, and compare against the main branch baseline. If cost per invocation increases more than 20% from baseline, it’s flagged. Not blocked — but the engineer has to explicitly acknowledge it in the PR.
Rollback Procedure
When things go wrong in production (they will):
- Identify —
@ops-botP1 alert fires (error rate > 5% over 15 minutes) - Isolate — disable the Fleet agent’s webhook endpoint (takes 2 minutes)
- Assess — is this a code regression or a model drift issue?
- Rollback —
git revert {merge-commit-sha}→ PR → fast-track merge (eval gate still runs, takes ~3 minutes) → automatic staging + production deploy - Verify —
GET /healthhealthy, P1 alert clears - Post-mortem — within 24 hours: root cause, eval gap identified, new test case written
Target MTTR for P1: ≤ 30 minutes. The rollback procedure itself takes ~10 minutes. The remaining 20 minutes is diagnosis and communication overhead — which is where runbooks pay off.
The Registry as Source of Truth
Every CI pipeline we’ve described references agents/registry.yaml — the fleet registry. This is intentional. The registry is the authoritative record of:
- What agents exist
- Their current status (staged / deploy_ready / active / deprecated)
- Which LangSmith project and eval dataset are associated
- Who owns them
The ci-registry-check.yml workflow enforces registry hygiene on every PR. You cannot merge a new agent directory without a corresponding registry entry. You cannot mark an agent active without a Fleet YAML. The registry doesn’t drift because the CI won’t let it.
Summary
The CI/CD architecture that works for AI agents has three non-negotiable components traditional pipelines lack:
- An eval gate — LLM-as-judge metrics against a golden dataset, P0 blocking, P1 advisory, runs on every PR
- Prompt versioning — prompts in git, pinned in AGENTS.md, tested like code
- A manual production promotion gate — 60 seconds of human judgment before the agent goes live
Everything else — lint, unit tests, staging deploys, health checks — is standard DevOps discipline applied to a new artifact type.
The teams that skip the eval gate and the production gate are the same teams that ship a prompt tweak on Friday afternoon and spend Saturday rolling it back. The pipeline overhead is low. The incident cost is high. The math is obvious.
Ready to operationalize your agent fleet?
The Diagnostic Sprint identifies your highest-leverage agentic use cases and delivers the first production-ready agents — with full CI/CD, eval gates, and runbooks included.
Start the Diagnostic SprintReady to build your agentic team?
Start with a Diagnostic Sprint — a 2–4 week structured audit that produces your prioritized Agentic Roadmap.
Start with a Diagnostic →