LangSmith Observability: How to Debug and Monitor AI Agents in Production

You can’t debug what you can’t see. That’s the core problem with AI agents in production: they’re non-deterministic, they call tools, they sometimes loop, they hallucinate in subtle ways, and a print() statement tells you almost nothing useful about why a run went wrong.

LangSmith is Langchain’s observability platform for LLM applications. If you’re building with LangChain or LangGraph, it’s the most direct path to understanding what your agents are actually doing at runtime. This post covers the setup, the key features you’ll use daily, how to run evaluations, and what specifically to watch in a multi-agent production system.

What LangSmith Does

LangSmith gives you four things:

Tracing — Every LLM call, tool invocation, and chain step is captured as a structured trace with inputs, outputs, latency, token counts, and cost.
Evaluations — Run automated tests against your traces: LLM-as-judge, exact match, custom scorers. Track score trends over time as you iterate on prompts and logic.
Prompt management — Version, compare, and deploy prompts from a central hub. Pull the active prompt version at runtime rather than hardcoding strings in source files.
Datasets — Build test sets from production traces or manually curated examples. Feed them into your eval harness.

Used together, these give you the feedback loop that makes iterating on agents tractable.

Setup: Three Environment Variables

LangSmith traces automatically when these three variables are set:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your_api_key_here
export LANGCHAIN_PROJECT=my-agent-project

Get your API key at smith.langchain.com. The free tier covers substantial usage — enough to run and debug most agents during development.

That’s it. No code changes. Every subsequent chain.invoke(), agent.run(), or graph.invoke() call will emit a trace.

In production, set these as environment variables in your deployment environment (Cloudflare Pages secrets, AWS Parameter Store, etc.) rather than committing them to source.

Project Naming Convention

LANGCHAIN_PROJECT groups traces. We recommend a naming convention:

{service-name}-{environment}

examples:
  research-agent-prod
  inbound-triage-staging
  document-processor-dev

This makes it easy to compare behavior across environments and spot regressions at a glance.

Reading a Trace

Once tracing is active, navigate to your project in the LangSmith UI and open any trace. Here’s what you’re looking at:

The span tree shows the hierarchy of calls within a single run. An outer AgentExecutor span contains child spans for each LLM call and each tool invocation. A LangGraph run shows each node as a distinct span. The nesting mirrors your code’s execution path.

Each span shows:

Input (the full prompt or tool arguments)
Output (the raw LLM response or tool return value)
Latency (milliseconds — useful for identifying bottlenecks)
Token usage (prompt tokens, completion tokens, total)
Cost (estimated, based on the model’s pricing)
Status (success, error, or timeout)

What to look for when debugging:

Unexpected inputs: the prompt your agent is actually receiving is often different from what you think you’re sending. Check the raw input in the span.
Tool call arguments: when a tool call fails or returns garbage, look at what arguments the LLM chose to pass. The problem is usually in the argument, not the tool.
Context truncation: if your agent’s answers are getting progressively worse in a long conversation, check token counts — you may be approaching the context window and earlier messages are being dropped.
Loop detection: in LangGraph, if you see the same node repeating more times than expected, the router logic isn’t exiting cleanly. The trace will show you exactly what state the router received each time.

Evaluations

Tracing tells you what happened. Evaluations tell you whether it was correct.

The Basic Evaluation Pattern

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# Define an evaluator — here, an LLM-as-judge checking answer quality
def correctness_evaluator(run, example):
    """Score 1 if the answer correctly addresses the question, 0 otherwise."""
    from langchain_openai import ChatOpenAI
    from langchain_core.messages import HumanMessage

    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    question = example.inputs["question"]
    expected = example.outputs["expected_answer"]
    actual = run.outputs["answer"]

    prompt = f"""You are evaluating an AI answer.
Question: {question}
Expected answer: {expected}
Actual answer: {actual}

Is the actual answer correct and responsive to the question? Reply with only "correct" or "incorrect"."""

    response = llm.invoke([HumanMessage(content=prompt)])
    score = 1 if "correct" in response.content.lower() else 0
    return {"key": "correctness", "score": score}


# Run evaluation against a dataset you've built in LangSmith
results = evaluate(
    lambda inputs: your_agent.invoke(inputs),
    data="your-dataset-name",
    evaluators=[correctness_evaluator],
    experiment_prefix="gpt4o-mini-v2",
)

The results appear in LangSmith’s Experiments view, where you can compare runs side by side.

What to Evaluate

For most agents, start with three metrics:

Correctness — Does the output correctly answer the question or complete the task? Use LLM-as-judge with a well-specified rubric, or exact-match if your output format is structured.

Groundedness — Does the output only use information present in the provided context? Critical for RAG systems. An ungrounded answer is a hallucination.

Task completion — For multi-step agents, did the agent complete the requested action? This is often a binary pass/fail based on side effects (was the record created? was the email sent?) rather than output content.

Building Your Dataset

The fastest way to build a dataset is from production traces. In the LangSmith UI:

Filter traces for runs that produced good outputs (you know these by spot-checking)
Select them and click “Add to Dataset”
Use that dataset as your regression set — any prompt change should not decrease performance against it

This is the flywheel: ship → trace → curate good examples → eval → iterate → ship.

Prompt Management

Hardcoding prompts in source files creates two problems: you can’t iterate on prompts without a code deployment, and you lose version history. LangSmith’s prompt hub solves both.

Push a Prompt

from langsmith import Client

client = Client()

client.push_prompt(
    "researcher-system-prompt",
    object=ChatPromptTemplate.from_messages([
        ("system", "You are a research assistant. Given a question, produce detailed, "
                   "factual research notes as structured bullet points. "
                   "Be specific. Include relevant facts, figures, and context."),
        ("human", "{question}"),
    ]),
)

Pull a Prompt at Runtime

prompt = client.pull_prompt("researcher-system-prompt")
# Use in your chain/agent as normal

Why This Matters

Prompt versions are immutable — you can always roll back
You can A/B test prompts by pulling different versions in different environments
Non-engineers can iterate on prompts in the UI without touching code
LangSmith shows which prompt version was active for any given trace — so if quality dropped on a specific date, you can correlate it with the prompt change that happened that day

In practice: store all your system prompts in LangSmith Hub. Pull them by name (and optionally version) at startup. Commit the prompt name and version to your deployment config.

What to Monitor in a Multi-Agent System

Single-agent observability is straightforward — one trace, one answer, was it right? Multi-agent systems have more failure modes to watch.

Agent-Level Metrics

Per-agent token consumption — Different agents have very different prompt sizes. An agent that’s consuming 10x more tokens than expected probably has a prompt that’s grown unchecked or a retrieval step that’s pulling too much context.

Per-agent error rate — If one agent is failing 20% of the time while others are at 1%, that’s where to focus. LangSmith lets you filter traces by error status within a project.

Latency per node — In LangGraph, each node is a separate span. If your end-to-end latency spiked, the span tree will show you exactly which node got slower. Common causes: tool latency, context window growing, model tier change.

System-Level Metrics

Unexpected loops — Set up an eval that scores whether the graph terminated in the expected number of steps. More steps than expected usually means a routing bug or a poorly-constrained loop exit condition.

Handoff fidelity — When agent A passes state to agent B, is agent B receiving what agent A intended to send? The inter-node state is visible in each node’s input span. Schema drift between agents (one adds a field, another expects it to not be there) is a common production bug.

Tool failure cascades — When a tool fails (network error, bad API response, timeout), does the agent handle it gracefully or does it hallucinate a result and continue as if nothing happened? Trace the tool’s output span — if it returned an error, check what the LLM did with it.

Setting Up Alerts

LangSmith doesn’t have built-in alerting (as of mid-2026), so the practical pattern is:

Run evaluations on a schedule (daily or per-deploy via CI)
Export results to a monitoring dashboard or post to Slack
Gate deploys on eval score thresholds in your CI pipeline

A simple CI eval gate:

# In your CI pipeline, after deploy to staging
results = evaluate(
    lambda inputs: agent.invoke(inputs),
    data="regression-dataset",
    evaluators=[correctness_evaluator],
)

avg_score = sum(r["results"]["correctness"]["score"] for r in results) / len(results)

if avg_score < 0.85:
    print(f"Eval failed: {avg_score:.2%} correctness. Blocking deploy.")
    exit(1)

print(f"Eval passed: {avg_score:.2%} correctness.")

LangSmith in the Agentic Runbook Stack

At Agentic Runbook, LangSmith is non-negotiable on every engagement. We set it up in the first sprint before a single line of agent logic is written. Here’s the standard setup:

Day 1: Tracing enabled, project named {client-slug}-{environment}, all agents and chains instrumented. Every developer on the engagement can see live traces.

Week 1: Baseline dataset built from the first 50 good traces. Correctness + groundedness evals running against it. Score baseline recorded.

Ongoing: Every prompt change is a new version in the Hub. Every deploy runs the eval suite. Score trends tracked in the project dashboard. Anything below baseline blocks the release.

Handoff: Client receives the LangSmith project with full trace history, the eval datasets, the prompt versions, and documentation of how to interpret the dashboard. They can continue iterating after we leave.

This is what “observable” means in practice. It’s not a bolt-on — it’s the foundation the entire delivery sits on.

Want your agent system built for observability from day one?

Our Diagnostic Sprint identifies your highest-value automation candidates and designs the system architecture — including a LangSmith observability layer — before a line of code is written.

Book a Diagnostic Sprint

Frequently Asked Questions

Q: What is LangSmith used for?

A: LangSmith is an observability and evaluation platform for LLM applications built with LangChain or LangGraph. It captures traces (the full input/output chain of every LLM call, tool invocation, and chain step in a run), runs automated evaluations against your traces, manages prompt versions, and stores curated datasets for regression testing. It’s used primarily during development to debug agent behavior, and in production to monitor quality, detect regressions, and track performance over time.

Q: Is LangSmith only for LangChain?

A: LangSmith works best with LangChain and LangGraph, which emit traces automatically. For non-LangChain code, you can use the LangSmith SDK directly to wrap any function or LLM call and emit traces manually. In practice, if you’re using LangGraph for your agent orchestration layer (which we recommend), you get full tracing with zero extra code.

Q: How much does LangSmith cost?

A: LangSmith has a free tier that covers a meaningful number of traces per month — sufficient for development and small production workloads. Paid plans scale with trace volume. As of mid-2026, the Developer plan is free up to 5,000 traces/month. Check smith.langchain.com/pricing for current tiers, as pricing changes regularly.

Q: What’s the difference between tracing and evaluation in LangSmith?

A: Tracing is passive observation — it records what happened during a run. Evaluation is active assessment — it scores whether what happened was correct. You need tracing to do useful evaluation (your trace gives you the inputs and outputs to score), but tracing alone doesn’t tell you whether the output was good. Both are necessary for a production-ready agent system.

Q: Can I use LangSmith for agents not built with LangChain?

A: Yes, though with more manual instrumentation. The LangSmith Python SDK lets you use the @traceable decorator or the RunTree API to wrap any Python function and emit traces. You lose the automatic span hierarchy that LangChain provides, but you can still get meaningful trace data. For agents built with non-LangChain frameworks (Llamaindex, custom, etc.), this is the practical approach.