Tutorial9 min read

How to Build a RAG System with Claude API (Complete Tutorial 2026)

Step-by-step tutorial to build a production-ready Retrieval-Augmented Generation system using the Claude API, Python, and ChromaDB. Includes full code examples.

How to Build a RAG System with the Claude API (2026 Tutorial)

Most AI demos are impressive. Most production AI apps disappoint — because they answer questions from thin air instead of your actual data.

Retrieval-Augmented Generation (RAG) closes that gap. Instead of relying on what the model memorized during training, RAG fetches relevant context from your documents at query time and passes it to Claude. The result: grounded, accurate, citable answers that don't hallucinate.

This tutorial walks you through building a working RAG pipeline from scratch using the Claude API, ChromaDB (a lightweight vector database), and Python. By the end you'll have a system that can answer questions over a custom document set — and you'll understand every moving part.


What Is RAG and Why Does It Matter?

Retrieval-Augmented Generation was introduced in a 2020 Meta paper and has since become the dominant pattern for deploying LLMs over private knowledge bases. The idea is simple:

  • Index — chunk your documents, embed them into vectors, store in a vector DB
  • Retrieve — when a user asks a question, embed the query, search for similar chunks
  • Generate — pass the retrieved chunks as context to the LLM, ask it to answer
  • RAG matters because:

    • LLMs don't know your data. Claude's training cutoff means it can't answer questions about your internal docs, recent product updates, or proprietary knowledge.
    • Context windows aren't infinite workarounds. Dumping entire codebases or documentation sets into every prompt is expensive and slow.
    • Hallucination rates drop sharply when models are given explicit retrieved context to cite.

    For the Claude Certified Architect (CCA) exam, RAG is listed as a core architectural pattern. Knowing how to implement it correctly — chunking strategy, embedding choice, retrieval scoring — is exam-critical and career-critical.


    Architecture Overview

    Here's the full pipeline we're building:

    Documents (PDFs, text, markdown)
            ↓
      Chunker (RecursiveTextSplitter)
            ↓
      Embedder (Voyage AI via Claude SDK)
            ↓
      Vector Store (ChromaDB — local)
            ↓
      Query Engine
            ↓
      Claude API (claude-sonnet-4-6) → Answer

    Tools we'll use:
    • anthropic — the official Claude Python SDK
    • chromadb — lightweight, file-backed vector database
    • voyageai — Anthropic's preferred embedding model (or use sentence-transformers if offline)
    • pypdf — PDF loading


    Step 1: Install Dependencies

    bashpip install anthropic chromadb voyageai pypdf

    Set your API keys:

    bashexport ANTHROPIC_API_KEY="sk-ant-..."
    export VOYAGE_API_KEY="pa-..."   # free tier available at voyageai.com


    Step 2: Load and Chunk Your Documents

    Chunking is the most underrated step in RAG. Too small and chunks lose context; too large and retrieval becomes noisy. A chunk size of 512 tokens with 64-token overlap is a solid starting point for technical documentation.

    pythonimport os
    from pathlib import Path
    from pypdf import PdfReader
    
    def load_pdf(file_path: str) -> str:
        """Extract all text from a PDF."""
        reader = PdfReader(file_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"
        return text
    
    def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
        """Split text into overlapping chunks by word count."""
        words = text.split()
        chunks = []
        start = 0
        while start < len(words):
            end = start + chunk_size
            chunk = " ".join(words[start:end])
            chunks.append(chunk)
            start += chunk_size - overlap
        return chunks
    
    # Example: load a PDF and chunk it
    raw_text = load_pdf("docs/anthropic_usage_policy.pdf")
    chunks = chunk_text(raw_text)
    print(f"Loaded {len(chunks)} chunks")

    Chunking tips:
    • For code: chunk by function/class boundary, not word count
    • For markdown: split on ## headings to preserve semantic units
    • For structured data: consider JSON chunks with metadata fields


    Step 3: Embed and Store in ChromaDB

    Now we embed each chunk into a vector and store it in ChromaDB. We'll use Voyage AI's voyage-3 model — it's optimized for retrieval and is the embedding model Anthropic recommends in their official cookbook.

    pythonimport voyageai
    import chromadb
    
    # Initialize clients
    voyage_client = voyageai.Client()
    chroma_client = chromadb.PersistentClient(path="./chroma_db")
    
    # Create (or load) a collection
    collection = chroma_client.get_or_create_collection(
        name="my_documents",
        metadata={"hnsw:space": "cosine"}  # cosine similarity for text
    )
    
    def embed_and_store(chunks: list[str], source: str = "unknown"):
        """Embed chunks and upsert into ChromaDB."""
        # Voyage AI supports batch embedding (up to 128 at once)
        batch_size = 64
        for i in range(0, len(chunks), batch_size):
            batch = chunks[i : i + batch_size]
            result = voyage_client.embed(batch, model="voyage-3", input_type="document")
            embeddings = result.embeddings
    
            collection.upsert(
                ids=[f"{source}-chunk-{i + j}" for j in range(len(batch))],
                embeddings=embeddings,
                documents=batch,
                metadatas=[{"source": source, "chunk_index": i + j} for j in range(len(batch))]
            )
        print(f"Stored {len(chunks)} chunks from '{source}'")
    
    # Index your documents
    embed_and_store(chunks, source="anthropic_usage_policy")

    The PersistentClient saves the database to disk — your embeddings survive between runs without re-indexing.


    Step 4: Retrieve Relevant Chunks

    At query time, embed the user's question using the same model (with input_type="query" — this matters for retrieval quality), then search ChromaDB for the closest chunks.

    pythondef retrieve(query: str, top_k: int = 5) -> list[dict]:
        """Embed a query and return the top-k most relevant chunks."""
        query_embedding = voyage_client.embed(
            [query], model="voyage-3", input_type="query"
        ).embeddings[0]
    
        results = collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k,
            include=["documents", "metadatas", "distances"]
        )
    
        chunks = []
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        ):
            chunks.append({
                "text": doc,
                "source": meta["source"],
                "score": 1 - dist  # convert cosine distance to similarity
            })
        return chunks

    Retrieval tuning levers:
    ParameterEffect
    top_kMore context vs. more noise — start at 5, tune up to 10
    input_type="query"Improves retrieval by using a query-optimized embedding mode
    hnsw:space="cosine"Better than L2 for semantic text similarity
    Score thresholdFilter out chunks below 0.65 similarity to cut noise

    Step 5: Generate with Claude

    Now we build the prompt and call the Claude API. The key pattern is the RAG system prompt — it tells Claude to answer only from the provided context and to cite its sources.

    pythonimport anthropic
    
    claude_client = anthropic.Anthropic()
    
    RAG_SYSTEM_PROMPT = """You are a helpful assistant that answers questions based ONLY on the provided context.
    
    Rules:
    1. Answer using only the information in the context sections below.
    2. If the context does not contain enough information to answer, say "I don't have enough information in my knowledge base to answer that."
    3. Cite the source at the end of your answer using [Source: <source name>].
    4. Be concise but complete — include all relevant details from the context."""
    
    def ask(query: str) -> str:
        """Full RAG pipeline: retrieve context, then generate an answer."""
        # 1. Retrieve relevant chunks
        retrieved = retrieve(query, top_k=5)
    
        # 2. Format context block
        context_blocks = []
        for i, chunk in enumerate(retrieved):
            context_blocks.append(
                f"[Context {i+1} | Source: {chunk['source']} | Score: {chunk['score']:.2f}]\n{chunk['text']}"
            )
        context_str = "\n\n---\n\n".join(context_blocks)
    
        # 3. Build the user message
        user_message = f"""Context:
    {context_str}
    
    ---
    
    Question: {query}"""
    
        # 4. Call Claude
        response = claude_client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            system=RAG_SYSTEM_PROMPT,
            messages=[{"role": "user", "content": user_message}]
        )
    
        return response.content[0].text
    
    # Try it
    answer = ask("What types of content does Anthropic's usage policy prohibit?")
    print(answer)


    Step 6: Putting It All Together

    Here's a minimal CLI wrapper that ties every step into a single script:

    python# rag_pipeline.py
    import os
    import anthropic
    import chromadb
    import voyageai
    from pypdf import PdfReader
    
    # --- Config ---
    EMBED_MODEL = "voyage-3"
    CLAUDE_MODEL = "claude-sonnet-4-6"
    CHUNK_SIZE = 512
    CHUNK_OVERLAP = 64
    TOP_K = 5
    
    # --- Clients ---
    voyage = voyageai.Client()
    chroma = chromadb.PersistentClient(path="./chroma_db")
    claude = anthropic.Anthropic()
    collection = chroma.get_or_create_collection("docs", metadata={"hnsw:space": "cosine"})
    
    # --- Core functions (from above) ---
    # load_pdf(), chunk_text(), embed_and_store(), retrieve(), ask()
    # ... (paste functions here)
    
    if __name__ == "__main__":
        import sys
    
        if sys.argv[1] == "index":
            path = sys.argv[2]
            text = load_pdf(path) if path.endswith(".pdf") else open(path).read()
            chunks = chunk_text(text)
            embed_and_store(chunks, source=os.path.basename(path))
    
        elif sys.argv[1] == "ask":
            query = " ".join(sys.argv[2:])
            print(ask(query))

    Usage:

    bash# Index a document
    python rag_pipeline.py index docs/my_policy.pdf
    
    # Ask a question
    python rag_pipeline.py ask "What are the key safety requirements?"


    Common RAG Pitfalls (and How to Fix Them)

    ProblemSymptomFix
    Chunks too largeIrrelevant context dilutes the answerReduce chunk_size to 256-512
    Wrong input_typeLow retrieval accuracyUse input_type="query" for questions, "document" for indexing
    No overlapAnswers miss cross-chunk contextSet overlap to 10-15% of chunk size
    Stale indexAnswers based on old docsRe-run embed_and_store() after doc updates, use source-based upsert
    Hallucination persistsClaude adds info not in contextHarden system prompt: "never add information not present in the context"
    Cosine score too lowRetrieved chunks are irrelevantFilter results where score < 0.65 and return "not enough info"

    Scaling Beyond the Basics

    Once your prototype works, here's what production RAG looks like:

    Hybrid Search — combine vector similarity with BM25 keyword search. Vector search handles semantic queries ("what does X mean"), BM25 handles exact-match queries ("find all mentions of clause 4.2"). Libraries like rank_bm25 integrate cleanly. Reranking — after retrieving 20 candidates, use a cross-encoder reranker (Voyage's rerank-2 model works well with voyage_client.rerank()) to reorder them by relevance before passing to Claude. This alone typically improves answer quality by 15-20%. Metadata filtering — store rich metadata with each chunk (author, date, document type, department) and filter at query time: collection.query(..., where={"source": "legal_docs"}). Streaming responses — for UI responsiveness, use claude_client.messages.stream() to stream tokens as they generate. Evaluation — track answer quality with RAGAS metrics (faithfulness, context precision, context recall). Run automated evals before every model or prompt change.

    Key Takeaways

    • RAG separates knowledge storage (vector DB) from reasoning (Claude), making your AI grounded and auditable
    • Chunking strategy and embedding quality are the biggest levers on retrieval accuracy — tune these before changing the LLM
    • Always use input_type="query" vs "document" when using Voyage AI embeddings
    • The RAG system prompt must explicitly constrain Claude to the provided context to prevent hallucination
    • Production RAG adds hybrid search + reranking on top of the baseline vector retrieval


    Next Steps

    Ready to go deeper? If you're preparing for the Claude Certified Architect (CCA) exam, RAG architecture is a core topic — you'll need to understand retrieval strategies, chunking tradeoffs, and prompt patterns cold.

    Explore our CCA Practice Test Bank — 200+ questions covering RAG, multi-agent systems, tool use, and the Anthropic API in exam format. Free sample questions available, no signup required.

    You might also want to read our guides on Claude multi-agent orchestration and Claude API prompt caching — both are natural extensions once your RAG pipeline is running in production.

    Sources:

    Ready to Start Practicing?

    300+ scenario-based practice questions covering all 5 CCA domains. Detailed explanations for every answer.

    Free CCA Study Kit

    Get domain cheat sheets, anti-pattern flashcards, and weekly exam tips. No spam, unsubscribe anytime.