Tutorial11 min read

How to Add Memory to Claude: Conversation History & Persistent Memory Tutorial

Learn how to add short-term conversation history and long-term persistent memory to Claude API apps. Covers messages array, context compression, vector DB memory, and tool-based recall.

How to Add Memory to Claude: A Developer's Complete Tutorial

One of the most common questions from developers building with Claude is: "How do I make Claude remember things?"

Out of the box, each Claude API call is stateless — the model has no idea what was said in a previous request. But building a useful AI assistant, chatbot, or agent almost always requires some form of memory. The good news is Claude's architecture makes this surprisingly clean to implement once you understand the layers.

This tutorial covers four memory patterns, from the simplest (conversation history in the messages array) to production-grade (external vector database with semantic retrieval). You'll have working code for each approach.

Why Claude Has No Built-in Memory (and Why That's Fine)

Claude processes whatever you put in its context window — nothing more, nothing less. This is a design choice, not a limitation. Stateless APIs are easier to scale, reason about, and secure. The responsibility of what context to include belongs to your application layer.

Think of it this way: Claude is an extremely capable reasoning engine. Memory is a retrieval problem. Separating the two gives you full control over privacy, relevance, and cost.

There are four distinct memory needs most apps have:

  • Within-session memory — remember what was said earlier in this conversation
  • Cross-session memory — remember facts from previous sessions
  • Knowledge memory — recall from a document corpus (RAG)
  • Episodic memory — recall specific past interactions by similarity
  • Let's implement each.


    Pattern 1: Conversation History (The Messages Array)

    This is the foundation. Claude's API uses a messages array where you pass the full conversation history with every request.

    pythonimport anthropic
    
    client = anthropic.Anthropic()
    
    # Start with an empty history
    conversation_history = []
    
    def chat(user_message: str) -> str:
        # Append the new user message
        conversation_history.append({
            "role": "user",
            "content": user_message
        })
        
        # Pass full history to Claude every time
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=conversation_history
        )
        
        assistant_message = response.content[0].text
        
        # Append Claude's response so it's included in the next turn
        conversation_history.append({
            "role": "assistant",
            "content": assistant_message
        })
        
        return assistant_message
    
    # Test it
    print(chat("My name is Sarah and I'm building a SaaS product for dentists."))
    print(chat("What are the key features I should prioritize?"))
    print(chat("What was my name again?"))  # Claude will remember

    This works perfectly for single-session conversations. The catch: context has a cost. As the conversation grows, you're sending more tokens every turn, and eventually you'll hit the model's context limit (though Claude's 1M token context window makes this a distant concern for most apps).

    Trimming Old Messages

    For very long conversations, implement a sliding window:

    pythonMAX_HISTORY_MESSAGES = 20
    
    def chat_with_trim(user_message: str) -> str:
        conversation_history.append({"role": "user", "content": user_message})
        
        # Keep only the last N messages
        trimmed_history = conversation_history[-MAX_HISTORY_MESSAGES:]
        
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=trimmed_history
        )
        
        assistant_message = response.content[0].text
        conversation_history.append({"role": "assistant", "content": assistant_message})
        return assistant_message


    Pattern 2: Cross-Session Memory with a System Prompt

    For apps where users return across multiple sessions, you need to persist facts between sessions. The cleanest way is to store key facts in a database and inject them into the system prompt at the start of each session.

    pythonimport json
    import anthropic
    from datetime import datetime
    
    client = anthropic.Anthropic()
    
    # Simulated user profile store (use PostgreSQL/Redis in production)
    user_profiles = {}
    
    def get_user_facts(user_id: str) -> str:
        """Retrieve stored facts about the user."""
        profile = user_profiles.get(user_id, {})
        if not profile:
            return "No prior information about this user."
        
        facts = []
        if profile.get("name"):
            facts.append(f"Name: {profile['name']}")
        if profile.get("role"):
            facts.append(f"Role: {profile['role']}")
        if profile.get("context"):
            facts.append(f"Context: {profile['context']}")
        if profile.get("last_session"):
            facts.append(f"Last spoke: {profile['last_session']}")
        
        return "\n".join(facts) if facts else "No stored facts yet."
    
    def save_user_facts(user_id: str, facts: dict):
        """Update the user's profile."""
        if user_id not in user_profiles:
            user_profiles[user_id] = {}
        user_profiles[user_id].update(facts)
        user_profiles[user_id]["last_session"] = datetime.now().strftime("%Y-%m-%d")
    
    def build_system_prompt(user_id: str) -> str:
        user_facts = get_user_facts(user_id)
        return f"""You are a helpful AI assistant with memory of this user.
    
    What you know about this user:
    {user_facts}
    
    When the user shares new personal information (name, role, goals, preferences), 
    remember it as part of the conversation context. Be natural — don't announce 
    that you're remembering things."""
    
    def chat_with_memory(user_id: str, message: str, history: list) -> str:
        system_prompt = build_system_prompt(user_id)
        
        history.append({"role": "user", "content": message})
        
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=system_prompt,
            messages=history
        )
        
        reply = response.content[0].text
        history.append({"role": "assistant", "content": reply})
        return reply
    
    # Usage
    user_id = "user_123"
    session_history = []
    
    # Session 1
    save_user_facts(user_id, {"name": "Sarah", "role": "founder", "context": "building dental SaaS"})
    
    r1 = chat_with_memory(user_id, "What features should I prioritize for my SaaS?", session_history)
    print(r1)
    
    # New session — history is reset, but facts persist
    session_history = []
    r2 = chat_with_memory(user_id, "Hey, I'm back! What were we discussing?", session_history)
    print(r2)  # Claude will recall Sarah's context from the system prompt

    Auto-Extracting Facts with Claude

    Instead of manually calling save_user_facts, use Claude to extract facts to remember:

    pythondef extract_facts_to_remember(conversation: list) -> dict:
        """Use Claude to extract key facts from a conversation."""
        extraction_prompt = """Review this conversation and extract facts worth remembering 
    about the user for future sessions. Return JSON only.
    
    Example output:
    {
      "name": "Sarah",
      "role": "SaaS founder", 
      "goals": "launch dental software by Q3",
      "preferences": "prefers concise answers"
    }
    
    Return {} if nothing worth storing."""
        
        response = client.messages.create(
            model="claude-haiku-4-5-20251001",  # Haiku is fast and cheap for extraction
            max_tokens=512,
            system=extraction_prompt,
            messages=conversation
        )
        
        try:
            return json.loads(response.content[0].text)
        except json.JSONDecodeError:
            return {}


    Pattern 3: Vector Database Memory (Semantic Recall)

    For knowledge-intensive apps — where you need Claude to recall from thousands of past interactions or documents — inject retrieved context using a vector database.

    This is the classic RAG (Retrieval-Augmented Generation) pattern applied to memory.

    pythonfrom anthropic import Anthropic
    import numpy as np
    
    client = Anthropic()
    
    # In production: use Pinecone, Weaviate, pgvector, or Chroma
    # Here we simulate with a simple in-memory store
    memory_store = []  # List of {"text": str, "embedding": list[float]}
    
    def get_embedding(text: str) -> list[float]:
        """Get embedding for a piece of text using Claude's API or a dedicated embeddings model."""
        # In production, use a dedicated embedding model (e.g., Voyage AI, OpenAI embeddings)
        # Claude itself doesn't have an embeddings endpoint — use Voyage AI (Anthropic's recommended partner)
        # For this tutorial, we simulate with a placeholder
        # voyage_client.embed([text], model="voyage-3").embeddings[0]
        raise NotImplementedError("Replace with your embedding provider")
    
    def cosine_similarity(a: list[float], b: list[float]) -> float:
        a_arr, b_arr = np.array(a), np.array(b)
        return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
    
    def remember(text: str):
        """Store a memory with its embedding."""
        embedding = get_embedding(text)
        memory_store.append({"text": text, "embedding": embedding})
    
    def recall(query: str, top_k: int = 3) -> list[str]:
        """Retrieve the most relevant memories for a query."""
        if not memory_store:
            return []
        
        query_embedding = get_embedding(query)
        scored = [
            (cosine_similarity(query_embedding, m["embedding"]), m["text"])
            for m in memory_store
        ]
        scored.sort(key=lambda x: x[0], reverse=True)
        return [text for _, text in scored[:top_k]]
    
    def chat_with_vector_memory(message: str, history: list) -> str:
        # Retrieve relevant past memories
        relevant_memories = recall(message)
        
        memory_context = ""
        if relevant_memories:
            memory_context = "\n\nRelevant memories:\n" + "\n".join(
                f"- {m}" for m in relevant_memories
            )
        
        system = f"""You are a helpful assistant with access to past conversation memories.
    {memory_context}
    
    Use these memories naturally when relevant. Don't mention that you're reading from memory."""
        
        history.append({"role": "user", "content": message})
        
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=system,
            messages=history
        )
        
        reply = response.content[0].text
        history.append({"role": "assistant", "content": reply})
        
        # Optionally store this exchange as a new memory
        remember(f"User said: {message}. Assistant replied: {reply[:200]}")
        
        return reply

    Tip: For production, swap the in-memory store for pgvector (if you're already on PostgreSQL/Neon) or Pinecone. Voyage AI is Anthropic's recommended embedding provider and integrates cleanly with Claude workflows.

    Pattern 4: Tool-Based Memory (Most Flexible)

    The most powerful pattern: give Claude tools to read and write its own memory. Claude decides what to remember and when to recall it.

    pythonimport json
    import anthropic
    
    client = anthropic.Anthropic()
    
    # Simple file-backed memory (use a database in production)
    MEMORY_FILE = "claude_memory.json"
    
    def load_memories() -> dict:
        try:
            with open(MEMORY_FILE, "r") as f:
                return json.load(f)
        except FileNotFoundError:
            return {}
    
    def save_memories(memories: dict):
        with open(MEMORY_FILE, "w") as f:
            json.dump(memories, f, indent=2)
    
    # Define memory tools for Claude
    tools = [
        {
            "name": "remember",
            "description": "Save an important fact or piece of information to long-term memory. Use this when the user shares something that should be recalled in future sessions.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "key": {
                        "type": "string",
                        "description": "A short identifier for this memory (e.g., 'user_name', 'user_goal', 'preferred_style')"
                    },
                    "value": {
                        "type": "string",
                        "description": "The information to remember"
                    }
                },
                "required": ["key", "value"]
            }
        },
        {
            "name": "recall",
            "description": "Retrieve a previously saved memory by key.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "key": {
                        "type": "string",
                        "description": "The key of the memory to retrieve"
                    }
                },
                "required": ["key"]
            }
        },
        {
            "name": "list_memories",
            "description": "List all stored memory keys so you know what information is available.",
            "input_schema": {
                "type": "object",
                "properties": {}
            }
        }
    ]
    
    def handle_tool_call(tool_name: str, tool_input: dict) -> str:
        memories = load_memories()
        
        if tool_name == "remember":
            memories[tool_input["key"]] = tool_input["value"]
            save_memories(memories)
            return f"Saved: {tool_input['key']} = {tool_input['value']}"
        
        elif tool_name == "recall":
            value = memories.get(tool_input["key"])
            return value if value else f"No memory found for key: {tool_input['key']}"
        
        elif tool_name == "list_memories":
            if not memories:
                return "No memories stored yet."
            return "Stored memory keys: " + ", ".join(memories.keys())
        
        return "Unknown tool"
    
    def chat_with_tool_memory(message: str, history: list) -> str:
        history.append({"role": "user", "content": message})
        
        system = """You are a helpful assistant with the ability to save and retrieve memories.
        
    At the start of conversations, use list_memories to see what you know about the user.
    When users share important information (name, preferences, goals, context), use remember() to save it.
    Use recall() when you need specific information about the user."""
        
        while True:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                system=system,
                tools=tools,
                messages=history
            )
            
            # If Claude wants to use a tool, handle it
            if response.stop_reason == "tool_use":
                history.append({"role": "assistant", "content": response.content})
                
                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        result = handle_tool_call(block.name, block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result
                        })
                
                history.append({"role": "user", "content": tool_results})
                continue
            
            # Claude gave a final response
            reply = ""
            for block in response.content:
                if hasattr(block, "text"):
                    reply += block.text
            
            history.append({"role": "assistant", "content": reply})
            return reply
    
    # Test
    history = []
    print(chat_with_tool_memory("Hi! I'm Alex, a backend engineer working on a Rust microservices project.", history))
    print(chat_with_tool_memory("What do you know about me?", history))
    
    # Start a new session — memories persist because they're saved to disk
    history = []
    print(chat_with_tool_memory("Hey, do you remember me?", history))


    Choosing the Right Memory Pattern

    PatternBest ForComplexityCost
    Messages arraySingle-session chatbotsLowLow
    System prompt injectionUser profile apps, returning usersLow-MediumLow
    Vector DB (RAG)Knowledge-heavy apps, large memory storesHighMedium
    Tool-based memoryAgents, complex workflowsMediumMedium
    Start with Pattern 1. Add Pattern 2 (system prompt injection) when users return across sessions. Only reach for vector memory or tool-based memory when your memory store grows beyond a few hundred items or you need semantic search.

    Production Checklist

    Before shipping a Claude app with memory, verify:

    • [ ] Privacy: Are you storing only what you need? Can users view and delete their memories?
    • [ ] Token budget: Log token usage per request. Memory injection increases costs.
    • [ ] Relevance decay: Old memories may become incorrect. Add a timestamp and prune memories older than 90 days.
    • [ ] Context ordering: Put memory context in the system prompt, not at the start of the messages array. System prompts get special treatment in Claude's attention.
    • [ ] Prompt caching: If your system prompt is large and stable (e.g., a big knowledge base), use Claude's prompt caching to cut costs by up to 90%.
    • [ ] Haiku for extraction: Use claude-haiku-4-5-20251001 for memory extraction and classification tasks — it's 10x cheaper and fast enough for background jobs.


    Key Takeaways

    • Claude has no built-in memory by design — your application owns the retrieval layer
    • The messages array handles within-session memory; persist facts to a database for cross-session recall
    • Vector databases unlock semantic memory retrieval at scale
    • Tool-based memory lets Claude manage its own recall, which is powerful for autonomous agents
    • Start simple (messages array + system prompt injection) and graduate to vector memory only when needed

    Next Steps

    Ready to build more sophisticated Claude agents? Check out our guide on Claude multi-agent orchestration or dive into the Claude API production best practices for patterns you can ship with confidence.

    If you're preparing for the Claude Certified Architect (CCA-F) exam, memory architecture is a key topic — context management, tool use, and agentic patterns all appear in the exam. Practice with our CCA exam guide and free practice questions.

    Ready to Start Practicing?

    300+ scenario-based practice questions covering all 5 CCA domains. Detailed explanations for every answer.

    Free CCA Study Kit

    Get domain cheat sheets, anti-pattern flashcards, and weekly exam tips. No spam, unsubscribe anytime.