How AI Agents Handle Memory: Short-Term, Long-Term, and Semantic Retrieval
One of the first questions engineering teams ask when they start evaluating AI agents seriously is: does the agent actually remember anything?
It’s a reasonable question, and the answer is more nuanced than most introductory material suggests. AI agents can maintain several different types of memory simultaneously — but each type works differently, costs different amounts to implement, and is appropriate for different use cases. Conflating them leads to either over-engineering (adding a vector database to a workflow that doesn’t need one) or under-engineering (assuming in-context state is enough for a use case that requires persistent long-term recall).
This post gives you a precise mental model for the three memory layers, with concrete implementation patterns for each using LangGraph. By the end, you’ll know which layer your agent needs — and which ones you can skip.
The Three Memory Layers
Production AI agents use memory across three distinct layers:
- Short-term / in-context memory — what the agent knows right now, within the current run
- Long-term / episodic memory — what the agent can recall across runs, from a persistent store
- Semantic retrieval memory — what the agent can look up by meaning, from a vector store
Each layer answers a different question. Each requires different infrastructure. Let’s take them in order.
Layer 1: Short-Term / In-Context Memory
Short-term memory is the simplest layer. It’s the information the agent holds in its current execution context — the equivalent of working memory in human cognition.
In LangGraph, short-term memory is implemented through the state object: a typed dictionary that flows through the graph and is readable and writable by every node. You define the state schema at the top of your agent:
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
current_task: str
tool_results: list[dict]
retry_count: int
Every node receives the current state as its first argument and returns a dictionary of updates. LangGraph merges those updates into the state before passing it to the next node. The messages field uses the add_messages reducer, which appends new messages to the existing list rather than replacing it — this is how conversation history accumulates within a run.
What belongs in short-term state:
- The message history for the current conversation or task
- Intermediate results from tool calls (API responses, database query results)
- Routing flags that control which branch the graph takes next
- Retry counters and error state
- The current task description or goal
What does not belong in short-term state:
- Large documents or corpora (put them in retrieval; don’t bloat the context)
- Historical data from previous sessions (that’s long-term memory)
- Static configuration values (use environment variables or a config object)
The cost of short-term memory is straightforward: every token in the state’s message history is sent to the LLM on every call. A bloated message history is a direct cost multiplier. For long-running agents with many intermediate steps, consider message trimming — keeping only the N most recent messages in the messages field — or message summarization — periodically compressing the history into a summary node.
from langchain_core.messages import trim_messages
def trim_node(state: AgentState) -> dict:
trimmed = trim_messages(
state["messages"],
max_tokens=4096,
strategy="last",
token_counter=len, # replace with a real token counter in production
)
return {"messages": trimmed}
Short-term memory is where you start. For many workflows — document extraction, single-session Q&A, one-off task automation — it’s the only layer you need.
Layer 2: Long-Term / Episodic Memory
Long-term memory is what allows an agent to remember things across sessions. Without it, every conversation starts from scratch: the agent has no knowledge of what it discussed with this user last week, no record of tasks it completed previously, and no continuity across restarts.
In LangGraph, long-term memory is implemented via checkpointers — persistent backends that save and restore graph state at the end of every node execution. The checkpointer is configured when you compile the graph:
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import MemorySaver
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
# Development: in-memory checkpointer (not durable)
memory = MemorySaver()
graph = workflow.compile(checkpointer=memory)
# Production: PostgreSQL checkpointer (durable, concurrent-safe)
async with AsyncPostgresSaver.from_conn_string(
"postgresql://user:password@host:5432/dbname"
) as checkpointer:
graph = workflow.compile(checkpointer=checkpointer)
With a checkpointer configured, every invocation of the graph takes a thread ID — a unique identifier for the conversation or session:
config = {"configurable": {"thread_id": "user-123-session-456"}}
result = await graph.ainvoke({"messages": [HumanMessage(content="What did we discuss last time?")]}, config=config)
LangGraph automatically loads the previous state for that thread ID before running the graph, and saves the updated state afterward. The agent has full access to the previous conversation history, prior tool results, and any state values set in past runs.
MemorySaver vs. AsyncPostgresSaver
MemorySaver stores checkpoints in memory. It’s fast, has no dependencies, and is the right choice during development. The critical limitation: it’s not durable. Every application restart loses all checkpoint data. An agent with MemorySaver in production will forget everything every time you redeploy.
AsyncPostgresSaver stores checkpoints in a PostgreSQL database. It’s durable, concurrent-safe, and supports multiple simultaneous sessions without state collision. This is the production choice. If you don’t already have PostgreSQL in your infrastructure, a managed instance (RDS, Supabase, Railway) is the lowest-overhead path to production-durable agent memory.
The migration from MemorySaver to AsyncPostgresSaver is a configuration change, not an architectural change — the graph code is identical. Change it before any production deployment, not after.
What Long-Term Memory Enables
With episodic memory, your agent can:
- Continue a multi-session task without re-explaining context each time
- Remember user preferences and prior decisions
- Maintain a running record of completed actions (useful for audit trails)
- Resume a long-running workflow after a failure or restart
The common pitfall with long-term memory: using thread IDs inconsistently. If your application generates a new thread ID for every request, the agent can never look up prior history even with a durable checkpointer. Thread IDs need to map to a meaningful unit of continuity — typically a user ID, a conversation ID, or a task ID — and that mapping needs to be consistent across requests.
Layer 3: Semantic Retrieval Memory
Semantic retrieval is the most powerful — and most overused — memory layer. It’s the right tool for a specific problem: retrieving information by meaning, when you don’t know the exact key or query that will surface the relevant content.
The canonical use case is a knowledge base: a corpus of documents, policies, prior conversations, or domain knowledge that the agent should be able to search. Semantic retrieval uses vector embeddings — numerical representations of meaning — to find the documents most similar to a query, even when the query doesn’t share exact words with the target documents.
How It Works
-
Indexing: Documents are chunked into segments (typically 512–1024 tokens), each chunk is converted to a vector embedding using an embedding model (e.g.,
text-embedding-3-smallfrom OpenAI), and the embeddings are stored in a vector database. -
Retrieval: At query time, the agent’s question or context is converted to an embedding using the same model. The vector database returns the N chunks whose embeddings are most similar to the query embedding (by cosine similarity or dot product).
-
Augmentation: The retrieved chunks are injected into the agent’s prompt context before the LLM call, giving the model access to the relevant information without that information needing to fit in the permanent context window.
This pattern is commonly called RAG (Retrieval-Augmented Generation).
Implementation with Qdrant and LangGraph
Qdrant is the vector store we use in production at Agentic Runbook: self-hostable, fast at scale, and has a clean Python client with first-class LangGraph integration.
from qdrant_client import QdrantClient
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings
# Initialize the vector store
client = QdrantClient(url="http://localhost:6333")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = QdrantVectorStore(
client=client,
collection_name="knowledge_base",
embedding=embeddings,
)
# Retrieval node in the LangGraph graph
def retrieve_node(state: AgentState) -> dict:
query = state["messages"][-1].content
docs = vector_store.similarity_search(query, k=5)
retrieved_content = "\n\n".join([doc.page_content for doc in docs])
return {
"tool_results": state["tool_results"] + [{"type": "retrieval", "content": retrieved_content}]
}
The retrieval node slots into the graph like any other node. Typically it runs before the main LLM reasoning node, so the model has access to the retrieved context when formulating its response.
When Does Your Agent Need a Vector Store?
You need a vector store when:
- The agent needs to search a corpus of documents that’s too large to fit in the context window
- The relevant information can’t be retrieved by exact key lookup (it needs to be found by meaning)
- The knowledge base changes over time and new content should be immediately searchable
You don’t need a vector store when:
- All relevant information fits in the agent’s context window
- Information can be retrieved by exact key (use a database query, not semantic search)
- The agent doesn’t need to search any external knowledge base at all
Adding a vector store introduces real operational complexity: you need to maintain the indexing pipeline, monitor embedding freshness, tune chunking and retrieval parameters, and handle index failures. Don’t add this layer unless the use case genuinely requires it.
Retrieval Quality Patterns
The most common failure mode with vector retrieval isn’t the vector database itself — it’s poor retrieval quality: the agent is given context, but the context doesn’t actually contain the information it needs.
Three patterns improve retrieval quality significantly:
Chunking strategy matters. Chunks that are too small lose context; chunks that are too large dilute relevance. For most prose documents, 512–768 tokens with 10–15% overlap is a reasonable starting point. For structured documents (tables, code), match the chunk boundary to the document structure rather than a fixed token count.
Query rewriting improves recall. The user’s raw query is often a poor vector search input — it’s ambiguous, uses conversational language, or doesn’t match the terminology in the corpus. Add a rewriting step that converts the user’s query into a more explicit retrieval query before hitting the vector store.
def rewrite_query_node(state: AgentState) -> dict:
rewrite_prompt = f"""Rewrite the following user query as a precise retrieval query
for searching a technical knowledge base. Be explicit and specific.
User query: {state["messages"][-1].content}
Retrieval query:"""
rewritten = llm.invoke(rewrite_prompt)
return {"current_task": rewritten.content}
Hybrid retrieval beats pure semantic search. Combining semantic similarity with keyword-based (BM25) retrieval catches cases where exact term matching matters. Qdrant supports hybrid search natively. For production knowledge base agents, enable it.
When Does Your Agent Need Each Layer?
| Use Case | Short-Term | Long-Term | Semantic |
|---|---|---|---|
| Single-turn document extraction | ✅ | ❌ | ❌ |
| Multi-turn customer support | ✅ | ✅ | ❌ |
| Knowledge base Q&A | ✅ | ❌ | ✅ |
| Persistent personal assistant | ✅ | ✅ | ✅ |
| Code generation (stateless) | ✅ | ❌ | ❌ |
| Compliance document search | ✅ | ❌ | ✅ |
| Multi-session project workflow | ✅ | ✅ | ❌ |
The most expensive mistake is adding long-term or semantic memory to an agent that doesn’t need it. Each layer adds operational surface area. Start with short-term only; add each subsequent layer when you have a concrete use case that requires it.
How to Test Memory Correctly
Memory is one of the hardest agent capabilities to test because failures are often subtle: the agent seems to remember something, but it’s actually confabulating; or the agent has the right information in context but fails to use it.
Test short-term memory by verifying that information introduced early in a conversation is correctly referenced later. Construct a multi-turn test case where the agent receives a fact in message 2 and must correctly apply it in message 8. If the agent fails, examine whether the fact is still in the message history at that point — it may have been trimmed.
Test long-term memory by running two separate invocations with the same thread ID. The second invocation should have access to state set in the first. If it doesn’t, the checkpointer isn’t persisting correctly. Check that you’re using AsyncPostgresSaver in your test environment, not MemorySaver.
Test semantic retrieval by constructing questions whose answers exist in the indexed corpus. Measure recall (did the retrieval surface the right chunk?) separately from generation quality (did the agent correctly use the retrieved chunk?). These are different failure modes with different fixes.
For all three layers, LangSmith traces are your primary debugging tool: every node’s inputs and outputs are logged, so you can see exactly what was in the state, what was retrieved, and what the model received before generating its response.
What This Means for Your Architecture
Memory architecture is one of the earliest decisions in a production agent build — and one of the hardest to change later, because it affects the state schema, the persistence infrastructure, and the retrieval pipeline simultaneously.
The teams that get into trouble are usually the ones who added semantic retrieval speculatively (“we might need a knowledge base someday”) and then had to maintain an indexing pipeline and a vector store that the agent rarely actually queries. Or the ones who shipped with MemorySaver and discovered in production that the agent resets every time the pod restarts.
The right sequence: start with short-term state only. Add long-term checkpointing when you have a concrete continuity requirement. Add a vector store when you have a corpus that needs semantic search. Design the memory architecture for the use case you actually have, not the one you imagine you might have.
Frequently Asked Questions
Q: Do AI agents remember between conversations?
Only if they’re configured with a durable checkpointer and a consistent thread ID. By default, LangGraph agents use in-memory state that resets on every invocation. To enable cross-session memory, you need AsyncPostgresSaver (or an equivalent durable backend) and a thread ID scheme that maps to a stable session or user identifier. Without both, the agent starts fresh every time.
Q: What is a vector store and does my agent need one?
A vector store is a database that stores numerical representations of meaning (embeddings) and supports similarity search — finding documents that are semantically similar to a query, even without exact keyword matches. Your agent needs one if it must search a document corpus that’s too large to fit in the context window. If all relevant information can be retrieved by exact lookup, or if it fits in the context directly, you don’t need a vector store. It adds real operational complexity — don’t add it speculatively.
Q: How does LangGraph handle agent state?
LangGraph uses a TypedDict-based state schema that you define when building the graph. Every node receives the current state as input and returns a dictionary of updates; LangGraph merges those updates into the state using per-field reducer functions (e.g., add_messages for message lists). State is passed between nodes automatically, persisted via a checkpointer if one is configured, and fully inspectable in LangSmith traces. This makes LangGraph’s state model significantly more robust than LangChain’s memory module approach for complex, multi-step workflows.
Q: What’s the difference between RAG and agent memory?
RAG (Retrieval-Augmented Generation) is a specific pattern where the agent queries a vector store to retrieve relevant context before generating a response. It’s one implementation of semantic retrieval memory — the third layer described in this post. Agent memory is a broader concept that includes short-term in-context state, long-term episodic persistence via checkpointers, and semantic retrieval. RAG is a subset. An agent can have robust short-term and long-term memory without any RAG at all; conversely, a basic RAG pipeline doesn’t necessarily have persistent cross-session memory.
Not sure which memory layers your agent actually needs?
Our Build phase designs the right memory architecture for your specific use case — state schema, checkpointing strategy, and retrieval design — before writing a line of production code. Avoid over-engineering and under-engineering in the same decision.
Talk to us about your agent architectureAgentic Runbook designs, builds, and transfers agentic AI systems for mid-market engineering, finance, and operations teams. Start with a Diagnostic Sprint →
Ready to build your agentic team?
Start with a Diagnostic Sprint — a 2–4 week structured audit that produces your prioritized Agentic Roadmap.
Start with a Diagnostic →