How to Choose the Right LLM for Your AI Agent (A Practical Framework)

Most AI agent cost overruns trace back to one decision made early in the project: picking the most powerful model available and using it for everything. It feels safe. It performs well in demos. Then the invoice arrives.

Model selection is one of the highest-leverage decisions in agent architecture. The wrong choice doesn’t just hurt your cloud bill — it produces slower agents, higher hallucination rates on tasks the model isn’t suited for, and brittle systems that break when a provider changes an API. This post gives you the framework to make the right call, task class by task class, before you commit to a production architecture.

The False Assumption: “Use the Most Powerful Model”

The instinct makes sense on the surface. Frontier models score better on benchmarks. They handle complex instructions. They’re less likely to produce obviously wrong output. Why not use them everywhere?

Three reasons.

Cost compounding. A multi-agent system making 8–15 LLM calls per workflow on a frontier model costs 10–20x more per invocation than the same workflow with appropriate model routing. At 5,000 invocations per week, that’s the difference between a $50 weekly bill and a $1,000 one — for identical output quality on most of your tasks.

Latency at the wrong layer. Frontier models are slower. For a classification node that fires on every request and needs to return in under 200ms to keep your p95 latency acceptable, running claude-opus-4 or gpt-4o is the wrong call — not because the model is bad, but because you’re paying for reasoning depth you don’t need at that step.

Capability mismatch doesn’t always favor the big model. Frontier models are trained to be helpful and thorough. On tasks like structured extraction or deterministic routing, that training can actually work against you. They over-explain, hedge when they should commit, and produce output that doesn’t conform to tight schemas as reliably as smaller, task-focused models. The most capable model and the most reliable model for a given task are often different models.

The right question isn’t “which model is best?” It’s “which model class handles this task reliably at the lowest cost?”

The Task Taxonomy: 4 Classes That Determine Model Requirements

Before you pick a model, you need to classify the task. Every step in every agent workflow falls into one of four task classes, each with different requirements for context window, reasoning depth, speed, and cost.

Class 1: Structured Extraction

What it is: Parsing raw input — documents, emails, API responses, user messages — into structured data. JSON output. Named entity extraction. Classification into a fixed label set. Schema population from free-form text.

What it requires: Schema adherence, consistency, and speed. It does not require deep multi-step reasoning. The model needs to recognize patterns and map them to a structure, not solve novel problems.

Model fit: Mid-tier or fast/cheap workhorses. claude-haiku-3-5, gpt-4o-mini, gemini-flash-2-0. Structured output modes (JSON mode, function calling with strict schemas) on these models are highly reliable. Running a frontier model here is unnecessary and expensive.

Key threshold: Schema conformance rate ≥ 99%. If your mid-tier model isn’t hitting that, improve your prompt or schema definition before upgrading the model.

Class 2: Open-Ended Reasoning

What it is: Tasks that require multi-step inference, synthesis across sources, or judgment calls that don’t reduce to a lookup or pattern match. Drafting a recommendation from conflicting evidence. Debugging a complex system issue given partial logs. Constructing a plan with dependencies and tradeoffs.

What it requires: Genuine reasoning depth. Large context window (often 50K+ tokens). Instruction-following fidelity on complex, multi-part prompts. Tolerance for ambiguity with appropriate hedging.

Model fit: Frontier reasoning models. claude-opus-4, gpt-4o, gemini-ultra-2-0. This is where you actually need the capability premium. Use these models here and nowhere else.

Key threshold: Task completion rate ≥ 90% on your eval dataset. If you’re using a frontier model and still failing to hit that bar, the problem is your prompt design or your task definition, not model capability.

Class 3: Long-Context Synthesis

What it is: Reading and synthesizing large documents, conversation histories, or codebases. Summarization of 100-page reports. Q&A over a full knowledge base. Multi-document comparison. These tasks are context-heavy but often not deeply reasoning-intensive — the model is mostly retrieval and compression, not inference.

What it requires: A large context window (100K+ tokens), reliable retrieval from within context, and coherent output generation. The reasoning required is modest — it’s mostly “find, combine, and restate.”

Model fit: This is the trickiest class because the requirements point in different directions. Context window rules out most small models; reasoning depth requirements are moderate. The best fit is often capable mid-tier models with large context windows — claude-sonnet-4, gemini-flash-2-0 (2M token context). Frontier models are appropriate here only when the synthesis requires significant judgment alongside the retrieval.

Key threshold: Faithfulness to source material ≥ 95% (measured by your eval). If the model is hallucinating content not present in the provided context, that’s a Class 3 failure — and it often means your context construction is the problem, not your model choice.

Class 4: Latency-Sensitive Real-Time

What it is: Steps in your agent workflow that are on the critical path for user-facing response time. Intent classification. Routing. Real-time decision nodes. Anywhere that adding 1–2 seconds of model latency directly degrades the user experience.

What it requires: Sub-500ms time to first token. High throughput. Consistency. It does not require significant capability — these tasks are almost always simple enough for the smallest available model.

Model fit: Fast/cheap workhorses only. claude-haiku-3-5, gpt-4o-mini, gemini-flash-2-0. Run these tasks on the fastest model that meets your accuracy threshold. In most cases, a well-prompted small model at this layer is indistinguishable in output quality from a frontier model — and 5–10x faster.

Key threshold: p95 latency ≤ 500ms for classification/routing nodes. If you’re not hitting that, check model size first. Context length second.

The 3-Tier Model Hierarchy

Every production agent fleet should operate on a three-tier model hierarchy. The tiers are defined by capability and cost, and each maps to the task classes above.

Tier	Models	Cost per 1M tokens (input)	Best for
Tier 1: Frontier Reasoning	`claude-opus-4`, `gpt-4o`, `gemini-ultra-2-0`	$15–$30	Class 2 (open-ended reasoning)
Tier 2: Capable Mid-Tier	`claude-sonnet-4`, `gpt-4o-mini`, `gemini-flash-2-0`	$0.15–$3	Class 1 (structured extraction), Class 3 (long-context synthesis)
Tier 3: Fast/Cheap Workhorse	`claude-haiku-3-5`, `gpt-3.5-turbo`, small open-source models (Llama 3.1 8B, Mistral 7B)	$0.05–$0.50	Class 4 (latency-sensitive real-time), simple Class 1 tasks

The rule is simple: run every task on the lowest tier that meets your quality threshold. Move up a tier only when you have eval data showing the lower tier fails.

In practice, most production agent workflows look like this: 60–70% of LLM calls belong on Tier 3 (classification, routing, simple extraction), 20–30% belong on Tier 2 (structured extraction with moderate complexity, long-context reads), and 10–20% belong on Tier 1 (actual reasoning tasks). Teams that run everything on Tier 1 are paying Tier 1 prices for work that could be done on Tier 3.

The Routing Architecture

The mechanism that maps tasks to the right tier is a model router — a lightweight layer in your agent workflow that classifies each task at runtime and selects the appropriate model. Here’s a straightforward Python implementation:

from enum import Enum
from dataclasses import dataclass
from langchain_anthropic import ChatAnthropic
from langchain_openai import ChatOpenAI


class TaskClass(str, Enum):
    STRUCTURED_EXTRACTION = "structured_extraction"
    OPEN_ENDED_REASONING = "open_ended_reasoning"
    LONG_CONTEXT_SYNTHESIS = "long_context_synthesis"
    LATENCY_SENSITIVE = "latency_sensitive"


@dataclass
class ModelConfig:
    model_id: str
    provider: str
    max_tokens: int
    temperature: float = 0.0


# Tier mappings — swap model IDs as providers update
TIER_CONFIG: dict[TaskClass, ModelConfig] = {
    TaskClass.STRUCTURED_EXTRACTION: ModelConfig(
        model_id="claude-sonnet-4-5",
        provider="anthropic",
        max_tokens=4096,
        temperature=0.0,
    ),
    TaskClass.OPEN_ENDED_REASONING: ModelConfig(
        model_id="claude-opus-4-5",
        provider="anthropic",
        max_tokens=8192,
        temperature=0.2,
    ),
    TaskClass.LONG_CONTEXT_SYNTHESIS: ModelConfig(
        model_id="claude-sonnet-4-5",
        provider="anthropic",
        max_tokens=8192,
        temperature=0.1,
    ),
    TaskClass.LATENCY_SENSITIVE: ModelConfig(
        model_id="claude-haiku-3-5",
        provider="anthropic",
        max_tokens=1024,
        temperature=0.0,
    ),
}


def get_model_for_task(task_class: TaskClass):
    """Return the appropriate LLM instance for a given task class."""
    config = TIER_CONFIG[task_class]
    if config.provider == "anthropic":
        return ChatAnthropic(
            model=config.model_id,
            max_tokens=config.max_tokens,
            temperature=config.temperature,
        )
    elif config.provider == "openai":
        return ChatOpenAI(
            model=config.model_id,
            max_tokens=config.max_tokens,
            temperature=config.temperature,
        )
    raise ValueError(f"Unknown provider: {config.provider}")


# Usage in a LangGraph node
def extract_entities_node(state: dict) -> dict:
    llm = get_model_for_task(TaskClass.STRUCTURED_EXTRACTION)
    # ... rest of node logic
    return state


def generate_analysis_node(state: dict) -> dict:
    llm = get_model_for_task(TaskClass.OPEN_ENDED_REASONING)
    # ... rest of node logic
    return state

A few things worth noting about this pattern:

The TIER_CONFIG dict is your single source of truth for model selection. When Anthropic releases claude-opus-5, you update one line — not every node in your graph.

temperature=0.0 on all extraction and routing tasks. You want deterministic output from deterministic inputs on those task classes. Reserve non-zero temperature for creative or reasoning-heavy tasks where output variation is acceptable or desirable.

The TaskClass enum forces explicit classification at the call site. This is intentional friction — it makes every model call a deliberate decision rather than a default, and it gives you a clear surface for auditing model usage.

The 5 Signals That Tell You Your Model Choice Is Wrong

Once your agent is in production, these five signals are the canary in the coal mine. Any one of them warrants an immediate model selection review.

Signal 1: Cost per invocation is too high

Threshold: If any workflow step is costing more than $0.05 per invocation and the task maps to Class 1 or Class 4, you’re almost certainly using the wrong tier. For high-volume workflows (500+ invocations/day), that’s a $750+/month premium on a single step.

Diagnosis: Check your LangSmith traces. Sort by token cost per run. If the expensive calls are routing or extraction nodes, downgrade the model.

Signal 2: Latency is unacceptable

Threshold: p95 latency > 3 seconds for a user-facing synchronous workflow. p95 > 500ms for any classification or routing node.

Diagnosis: If latency is concentrated at a specific node, and that node is a Class 4 task, you’re running a Tier 1 model where Tier 3 belongs. The fix is model downgrade, not infrastructure scaling.

Signal 3: Task completion rate below threshold

Threshold: < 85% on your eval dataset for any task class.

Diagnosis: Counterintuitively, upgrading the model is often not the first step here. Start by reviewing whether the task is being correctly classified. If you’re routing a Class 2 task to a Tier 3 model, no amount of prompt engineering will fix it — that’s a routing issue. If you’ve confirmed the tier is correct and completion rate is still low, then evaluate whether a tier upgrade actually moves the metric before committing to it.

Signal 4: Hallucination rate is elevated

Threshold: > 5% factual errors on your eval dataset. > 1% for financial, legal, or customer-facing contexts.

Diagnosis: Hallucination is often misattributed to model capability. More often, it’s a context problem — the model doesn’t have the information it needs to answer accurately. Check your retrieval layer before upgrading your model. If context quality is good and hallucination persists, a Tier 1 model may be warranted for that task.

Signal 5: Over-refusal rate is elevated

Threshold: > 3% of valid, in-scope requests result in a refusal or an overly hedged non-answer.

Diagnosis: Frontier models are more likely to refuse borderline requests than mid-tier models, because their RLHF training was more aggressive. If your over-refusal rate is high and the requests are legitimately in-scope, a mid-tier model with more targeted system prompt instructions may actually outperform the frontier model on this metric.

The Model Selection Decision Checklist

Before committing a model to any step in a production agent, work through this checklist:

Ten checks. If any are unresolved, the model choice isn’t ready for production.

The Mistake of Model Monoculture

A production agent fleet that runs on a single model — even a very good one — has a structural fragility that most teams don’t appreciate until something goes wrong.

Provider outages hit your entire fleet simultaneously. If every agent in your system runs on OpenAI and OpenAI has a degraded API, every workflow is impaired at the same time. A fleet with Anthropic calls in the critical path and OpenAI calls in non-critical steps has natural blast radius reduction.

Model deprecations force emergency migrations. LLM providers deprecate models — sometimes with 3–6 months notice, sometimes less. A fleet built around a single model faces a single, high-urgency migration event. A fleet already using 2–3 models has the muscle memory and the codebase patterns to handle model updates as routine operations.

No cost optimization lever. If you’re on a single model, your only cost levers are context reduction and caching. A multi-tier fleet gives you a powerful additional lever: downgrading tasks that don’t need expensive models.

Benchmark your real tasks against multiple providers. Provider performance on standard benchmarks (MMLU, HumanEval, etc.) often doesn’t predict relative performance on your specific task distribution. Teams that run a single provider because they “picked the best one from the benchmarks” have frequently not tested whether that model is actually best for their specific tasks.

The practical minimum: every production fleet should have at least 2 model tiers in active use. Tier 1 for reasoning-intensive tasks, Tier 2 or Tier 3 for classification, extraction, and routing. This is achievable without significant additional infrastructure — it’s a routing config change, not an architectural overhaul.

Evaluation-Driven Model Selection

Benchmarks are a starting point, not a decision. GPT-4o scores higher than claude-haiku-3-5 on standard reasoning benchmarks. That tells you nothing about whether claude-haiku-3-5 is sufficient for your customer intent classification task, or whether it outperforms gpt-4o-mini on your specific structured extraction schema.

The rule is: don’t commit a model to production until you’ve run your eval dataset on it.

This is codified in ADR-004 (LLM Provider Selection Standards) and ADR-024 (Agent Eval Framework) — both of which require eval-gated model selection before production promotion. The spec is straightforward:

Build a task-specific eval dataset (minimum 25 cases for a PR gate; 100 cases for a production promotion audit per ADR-024).
Define your hard gates: task completion rate, schema conformance, hallucination rate.
Run each candidate model against the dataset. Record the pass rate, cost per invocation, and p95 latency.
Select the lowest tier that passes all hard gates.

This takes a day to set up for a new task. It takes 20 minutes to run for a model comparison. It prevents the expensive, embarrassing scenario of discovering in production that the “best” model was the wrong choice for your specific task distribution.

A few evaluation practices that matter in model selection specifically:

Test the tier below your intuition first. Most engineers reaching for claude-opus-4 haven’t tested whether claude-sonnet-4 or even claude-haiku-3-5 would hit the quality bar. Run the cheapest plausible model first. Move up only when you have data showing it fails.

Separate model evaluation from prompt evaluation. If your eval scores are low, the problem might be your prompt, not your model. Run the same prompt across 2–3 model tiers and compare results. If all tiers score similarly, the prompt is probably the bottleneck. If there’s a clear performance cliff between tiers, the model matters.

Re-evaluate when models are updated. gpt-4o-mini today is meaningfully different from gpt-4o-mini at release. Model providers push continuous updates. A quarterly re-eval of your model choices against your production eval dataset catches silent regressions and performance improvements you should be taking advantage of.

Track eval results over time. LangSmith’s evaluation dataset tooling makes it trivial to run the same dataset repeatedly and track scores over time. A model that was adequate 6 months ago may be adequate today — or it may have regressed, or a cheaper alternative may have improved past it. You don’t know unless you measure.

Putting It Together

Model selection isn’t a one-time decision. It’s an ongoing operational practice. The framework:

Classify every task into one of the four task classes before selecting a model.
Start at the bottom of the tier hierarchy and move up only when eval data justifies it.
Run your own eval dataset before committing to any model for any production task. Benchmarks are a starting point; your tasks are the real test.
Implement a model router so model selection is centralized, auditable, and changeable without touching individual nodes.
Monitor the five production signals — cost, latency, task completion, hallucination, over-refusal — and treat regressions as model selection questions before reaching for architectural changes.
Maintain multi-tier diversity in your fleet. Single-provider monoculture is a fragility, not a simplification.
Re-evaluate quarterly. The model landscape moves fast. The optimal configuration for your fleet today may not be optimal in six months.

The teams operating reliable, cost-efficient agent systems at scale aren’t using the most powerful models. They’re using the right models — matched to task class, validated against real eval data, and tuned as their workloads evolve. That discipline is what separates a sustainable agent operation from a demo that became a production liability.

Build the Right Agent Stack from Day One

The Diagnostic Sprint maps your use cases to the right models, tools, and architecture — then gives you the implementation runbook to build it. 4–6 weeks. Full knowledge transfer.

Learn About the Diagnostic Sprint