The AI Agent Stack: What Every CTO Needs to Know in 2026

If you’re a CTO evaluating AI agents in 2026, the core challenge isn’t deciding whether to build — it’s deciding what to build with. The tooling ecosystem has matured rapidly, but it’s also fragmented. You have credible choices at every layer of the stack, and the wrong pick at any layer creates technical debt that’s expensive to unwind.

This guide documents the production AI agent stack we use at Agentic Runbook — the decisions we’ve already made, the pitfalls we’ve already hit, and the framework that helps engineering leaders evaluate their own architecture before committing.

There are six layers. Each has a decision to make. Let’s go through them.

Layer 1: The LLM Layer

The foundation of any agent is the model it reasons with. In 2026, the three most common choices for production workloads are GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), and Gemini 1.5 Pro (Google). Each has a different profile.

GPT-4o is the default for most teams for good reason: it has the broadest tool-calling support, the most extensive fine-tuning ecosystem, and the deepest integrations with the orchestration frameworks that matter (LangGraph, in particular). For agents that need reliable structured output, JSON mode, and function calling at scale, GPT-4o is the lowest-friction choice.

Claude 3.5 Sonnet has a longer context window (200K tokens), tends to produce more careful, nuanced reasoning in complex multi-step tasks, and shows fewer hallucinations on knowledge-intensive workflows. If your agent is processing long documents, writing code, or making consequential decisions where over-confidence is a risk, Claude 3.5 Sonnet is worth serious evaluation.

Gemini 1.5 Pro is the right call when you’re already deep in the Google Cloud ecosystem or need native multimodal capabilities (images, audio, video) as first-class inputs. Its 1M token context window is unmatched for use cases where the entire relevant corpus fits in a single prompt.

The common pitfall: committing to a single model across all agent tasks. Production agents benefit from tiered model execution — use a faster, cheaper model (GPT-4o mini, Claude Haiku) for classification, routing, and low-stakes subtasks; reserve the frontier model for the steps that actually require it. This can cut inference costs by 60–80% without meaningful quality loss.

Decision criteria: Start with GPT-4o unless you have a specific reason not to. Evaluate Claude 3.5 Sonnet for document-heavy or high-stakes reasoning tasks. Build the model call behind an abstraction layer so you can swap without rearchitecting.

Layer 2: The Orchestration Layer

The orchestration layer is where most teams make their most consequential architectural mistake. The question is not “should I use an orchestration framework?” — it’s “which one, and do I understand what it’s actually for?”

LangGraph is the right choice for production AI agents. It models agent workflows as directed graphs where nodes represent steps (LLM calls, tool calls, routing logic, human checkpoints) and edges represent control flow. State is a first-class primitive: you define a typed state schema, and every node can read from and write to that state. State persists across transitions via checkpointing, enabling pause-and-resume, human-in-the-loop, and fault-tolerant execution.

The graph model matters because real agent workflows are not linear. They branch (“did the API call succeed? if not, retry with a modified query”), they loop (“keep searching until the answer meets the quality threshold”), and they pause (“request human approval before sending this email”). LangGraph handles all of these natively. Raw LangChain chains require increasingly complex workarounds.

Why not raw LangChain? LangChain’s chain abstraction is productive for single-pass pipelines — RAG, document Q&A, simple extraction. It’s the wrong abstraction the moment your workflow needs persistent state, conditional branching, or multi-agent coordination. Teams that start with LangChain chains and try to grow them into production agents end up maintaining framework-adjacent code: using LangChain’s components while building their own orchestration layer on top. That’s the worst of both worlds.

Why not raw OpenAI function calling? It’s viable for simple, single-agent workflows. It breaks down at multi-agent coordination, complex state management, and observability. You end up rebuilding what LangGraph already provides.

Decision criteria: Default to LangGraph for any workflow that will run in production. Use LangChain components (document loaders, embedding wrappers, output parsers) as building blocks within the LangGraph orchestration layer — that’s a clean pattern that gives you both.

Layer 3: The Observability Layer

You cannot operate production AI agents without observability. This is not a nice-to-have — it’s the mechanism that tells you whether your agent is working correctly, where latency is accumulating, what happened during a specific failure, and whether a model change improved or degraded quality.

LangSmith is the observability layer for the LangGraph stack. It captures full traces for every execution: every LLM call, tool invocation, and graph node transition, with inputs, outputs, latency, and token counts logged per step. For a complex multi-agent workflow with 30 LLM calls and 15 tool calls, you get a structured trace that shows exactly what happened and in what order.

Beyond tracing, LangSmith provides three capabilities that become critical at production scale:

Evaluations: Run your agent against a labeled dataset and score outputs against defined criteria. This is how you measure whether a model upgrade actually improved quality, or just changed the error pattern.
Cost tracking: Token usage and cost per trace, aggregated by agent, user, or time period. This is how you catch model calls that are consuming disproportionate budget.
Feedback collection: Capture explicit thumbs-up/thumbs-down signals or annotation labels from human reviewers, tied to specific traces. This builds your eval dataset over time.

The common pitfall: adding LangSmith after launch. Teams that skip observability during development end up debugging production issues from user reports and print() statements. Instrumenting LangSmith from day one — it’s a two-line setup — means you have trace history from the first run.

Decision criteria: Use LangSmith. Set LANGCHAIN_TRACING_V2=true in your environment from the first day of development. Define at least two automated eval criteria before you ship to production.

Layer 4: The Memory Layer

Memory in an AI agent isn’t a single thing — it’s two distinct layers that serve different purposes and require different implementations.

Short-term memory (in-context) is the state that exists within a single agent run. In LangGraph, this is the state object: a TypedDict that holds the message history, intermediate results, tool outputs, and any other information the agent needs during the current execution. Short-term memory is fast, cheap, and automatically scoped to the run — it doesn’t require an external store.

The design question for short-term memory is what goes in the state schema. Bloated state objects slow down every LLM call (more tokens to process) and create debugging complexity. Keep the state schema minimal: only the data that needs to flow between nodes.

Long-term memory (persistent) is information that should survive beyond a single run — user preferences, prior conversation history, completed task records, domain knowledge. In LangGraph, this is implemented via checkpointers: MemorySaver for development/testing (in-memory, not durable) and AsyncPostgresSaver for production (persists to PostgreSQL, supports concurrent sessions, survives restarts).

For semantic long-term memory — facts and documents that should be retrieved by meaning, not by exact key — you need a vector store. The production choice at Agentic Runbook is Qdrant: self-hostable, fast at scale, clean Python client, and straightforward integration with LangGraph’s retrieval nodes. Alternatives include Pinecone (managed, less operational overhead) and pgvector (PostgreSQL extension, good for teams that want to minimize infrastructure footprint).

The common pitfall: using MemorySaver in production. It’s fine for development — fast setup, no dependencies. But it stores state in memory, which means every restart loses all history. A production agent that forgets everything on redeploy is not a production agent.

Decision criteria: Start with MemorySaver in development. Migrate to AsyncPostgresSaver before any production deployment. Add a vector store only when your agent genuinely needs semantic retrieval — it’s real complexity. Don’t add it by default.

Layer 5: The Tool Layer

Tools are how agents interact with the world: APIs, databases, file systems, external services. The tool layer is where agents create real value — and where they create real risk.

The significant development in 2025–2026 is the emergence of MCP (Model Context Protocol) as a standardization layer for tool integrations. MCP is an open protocol (originally from Anthropic, now broadly adopted) that defines a standard interface for exposing tools to agents. Instead of every team writing custom tool wrappers for Slack, GitHub, Salesforce, and their internal APIs, MCP-compatible tools expose a standard schema that any MCP-compatible agent can discover and call.

The practical impact: the ecosystem of pre-built, production-quality MCP tool servers is growing rapidly. For common integration targets — web search, code execution, database queries, file operations — you can often use an existing MCP server rather than building a custom tool. This reduces the tool layer from a bespoke engineering problem to a configuration problem.

For tools without MCP coverage, LangGraph’s tool node abstraction is the right pattern: define a Python function with a typed signature, annotate it with @tool, and LangGraph handles the JSON schema generation, error handling, and retry logic.

The common pitfall: tools with no guardrails. An agent that can write to a production database or send emails to external recipients needs explicit bounds: read-only vs. read-write permissions, human approval for high-stakes actions, audit logging for every tool call. LangGraph’s interrupt() mechanism is the primitive for human-in-the-loop approval gates.

Decision criteria: Evaluate MCP compatibility before building a custom tool. For custom tools, define the permission model before writing the first line of code. Log every tool call — LangSmith does this automatically.

Layer 6: The Deployment Layer

The deployment layer is where many teams underinvest. Agents that run correctly in a local notebook frequently behave differently in production: different environment variables, different concurrency behavior, different memory pressure. The deployment architecture needs to be chosen to match the agent’s execution model.

Three patterns cover most production use cases:

AWS Lambda is the right choice for event-driven, short-duration agent invocations — webhook handlers, async task processors, document-triggered workflows. Lambda’s cold start latency (500ms–2s) is acceptable for async tasks but unacceptable for synchronous user-facing interactions. Benefit: zero infrastructure management, pay-per-invocation pricing.

Amazon ECS (Fargate) is the right choice for long-running agents, streaming responses, and high-concurrency workloads. ECS containers have no cold start penalty, support WebSocket connections for streaming, and can be right-sized for memory-intensive agents. Benefit: predictable latency, full control over runtime environment.

Fleet by LangSmith is a managed deployment platform purpose-built for LangGraph agents. It handles containerization, scaling, and LangSmith integration automatically. If you want to minimize deployment operational overhead and you’re already committed to the LangGraph/LangSmith stack, Fleet is worth evaluating. Benefit: fastest path from LangGraph code to production deployment.

The common pitfall: deploying a stateful agent to a stateless function without thinking through checkpoint persistence. If your agent uses AsyncPostgresSaver and Lambda functions share no memory across invocations, that’s correct. If you’re using MemorySaver and wondering why state isn’t persisting across Lambda invocations, that’s the bug.

Decision criteria: Lambda for event-driven async tasks. ECS for long-running or streaming agents. Fleet for teams that want managed deployment with native LangSmith integration.

The Full Stack, Summarized

Layer	Production Choice	Key Decision
LLM	GPT-4o / Claude 3.5 Sonnet	Task requirements; tiered execution for cost
Orchestration	LangGraph	Stateful graphs, not linear chains
Observability	LangSmith	Tracing + evals from day one
Memory	LangGraph state + AsyncPostgresSaver + Qdrant	Short-term in state; long-term in Postgres; semantic in vector store
Tools	MCP protocol + LangGraph tool nodes	Permission model before implementation
Deployment	Lambda / ECS / Fleet	Match to agent execution model

How Agentic Runbook Evaluates Your Stack

The most common situation we encounter: a team has already made several of these decisions — sometimes well, sometimes under time pressure — and is now hitting the limits of those choices in production. The memory layer isn’t persisting correctly. The model is expensive and slow. The agent’s control flow has grown too complex for the chain abstraction it was built on.

We designed the Diagnostic Sprint to give engineering leaders clarity on exactly these questions: what’s working, what’s creating hidden debt, and what needs to change before the agent scales. It’s a fixed-scope, fixed-price engagement that produces a written assessment and a concrete build plan.

Frequently Asked Questions

Q: What components make up an AI agent stack?

A production AI agent stack has six layers: the LLM (the reasoning engine), the orchestration framework (controls the workflow), observability (traces, evals, cost), memory (short-term state + long-term persistence), tools (APIs and integrations the agent can call), and deployment infrastructure. Missing or underinvesting in any layer creates operational problems at scale.

Q: Do I need LangGraph, or can I use raw OpenAI function calling?

Raw OpenAI function calling works for simple, single-agent workflows that don’t require persistent state, branching logic, or multi-agent coordination. As soon as you need any of those — and most production agents eventually do — you’re rebuilding LangGraph’s primitives from scratch. Start with LangGraph if there’s any real complexity in your workflow. The incremental learning curve is a one-time cost; the wrong architecture is ongoing.

Q: How do I monitor AI agents in production?

LangSmith is the standard observability layer for LangGraph agents. It captures full execution traces automatically — every LLM call, tool invocation, and node transition with inputs, outputs, latency, and token usage. Beyond traces, LangSmith supports automated evaluations (running your agent against a labeled dataset), cost tracking, and human feedback collection. Set it up from day one; retrofitting observability is significantly harder.

Q: What’s the difference between an AI agent and a chatbot?

A chatbot takes a user input, generates a response, and stops. It’s a single-pass system. An AI agent takes a goal, plans a sequence of steps, calls tools to gather information or take actions, evaluates the results, and continues until the goal is achieved — or escalates to a human when it can’t. Agents can browse the web, write and execute code, query databases, send messages, and coordinate with other agents. The key distinction is autonomy over multi-step workflows versus single-turn response generation.

Not sure if your current stack will hold up in production?

The Diagnostic Sprint evaluates your AI agent architecture layer by layer — LLM selection, orchestration, observability, memory, tools, and deployment — and produces a written assessment with a concrete build plan. Fixed scope, fixed price.

Start with a Diagnostic Sprint

Agentic Runbook designs, builds, and transfers agentic AI systems for mid-market engineering, finance, and operations teams. Start with a Diagnostic Sprint →