How to Build a RAG System with the Claude API (2026 Tutorial)

Most AI demos are impressive. Most production AI apps disappoint — because they answer questions from thin air instead of your actual data.

Retrieval-Augmented Generation (RAG) closes that gap. Instead of relying on what the model memorized during training, RAG fetches relevant context from your documents at query time and passes it to Claude. The result: grounded, accurate, citable answers that don't hallucinate.

This tutorial walks you through building a working RAG pipeline from scratch using the Claude API, ChromaDB (a lightweight vector database), and Python. By the end you'll have a system that can answer questions over a custom document set — and you'll understand every moving part.

What Is RAG and Why Does It Matter?

Retrieval-Augmented Generation was introduced in a 2020 Meta paper and has since become the dominant pattern for deploying LLMs over private knowledge bases. The idea is simple:

Index — chunk your documents, embed them into vectors, store in a vector DB

Retrieve — when a user asks a question, embed the query, search for similar chunks

Generate — pass the retrieved chunks as context to the LLM, ask it to answer

RAG matters because:

LLMs don't know your data. Claude's training cutoff means it can't answer questions about your internal docs, recent product updates, or proprietary knowledge.
Context windows aren't infinite workarounds. Dumping entire codebases or documentation sets into every prompt is expensive and slow.
Hallucination rates drop sharply when models are given explicit retrieved context to cite.

For the Claude Certified Architect (CCA) exam, RAG is listed as a core architectural pattern. Knowing how to implement it correctly — chunking strategy, embedding choice, retrieval scoring — is exam-critical and career-critical.

Architecture Overview

Here's the full pipeline we're building:

Documents (PDFs, text, markdown)
        ↓
  Chunker (RecursiveTextSplitter)
        ↓
  Embedder (Voyage AI via Claude SDK)
        ↓
  Vector Store (ChromaDB — local)
        ↓
  Query Engine
        ↓
  Claude API (claude-sonnet-4-6) → Answer

Tools we'll use:

anthropic — the official Claude Python SDK
chromadb — lightweight, file-backed vector database
voyageai — Anthropic's preferred embedding model (or use sentence-transformers if offline)
pypdf — PDF loading

Step 1: Install Dependencies

bashpip install anthropic chromadb voyageai pypdf

Set your API keys:

bashexport ANTHROPIC_API_KEY="sk-ant-..."
export VOYAGE_API_KEY="pa-..."   # free tier available at voyageai.com

Step 2: Load and Chunk Your Documents

Chunking is the most underrated step in RAG. Too small and chunks lose context; too large and retrieval becomes noisy. A chunk size of 512 tokens with 64-token overlap is a solid starting point for technical documentation.

pythonimport os
from pathlib import Path
from pypdf import PdfReader

def load_pdf(file_path: str) -> str:
    """Extract all text from a PDF."""
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    """Split text into overlapping chunks by word count."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

# Example: load a PDF and chunk it
raw_text = load_pdf("docs/anthropic_usage_policy.pdf")
chunks = chunk_text(raw_text)
print(f"Loaded {len(chunks)} chunks")

Chunking tips:

For code: chunk by function/class boundary, not word count
For markdown: split on ## headings to preserve semantic units
For structured data: consider JSON chunks with metadata fields

Step 3: Embed and Store in ChromaDB

Now we embed each chunk into a vector and store it in ChromaDB. We'll use Voyage AI's voyage-3 model — it's optimized for retrieval and is the embedding model Anthropic recommends in their official cookbook.

pythonimport voyageai
import chromadb

# Initialize clients
voyage_client = voyageai.Client()
chroma_client = chromadb.PersistentClient(path="./chroma_db")

# Create (or load) a collection
collection = chroma_client.get_or_create_collection(
    name="my_documents",
    metadata={"hnsw:space": "cosine"}  # cosine similarity for text
)

def embed_and_store(chunks: list[str], source: str = "unknown"):
    """Embed chunks and upsert into ChromaDB."""
    # Voyage AI supports batch embedding (up to 128 at once)
    batch_size = 64
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i : i + batch_size]
        result = voyage_client.embed(batch, model="voyage-3", input_type="document")
        embeddings = result.embeddings

        collection.upsert(
            ids=[f"{source}-chunk-{i + j}" for j in range(len(batch))],
            embeddings=embeddings,
            documents=batch,
            metadatas=[{"source": source, "chunk_index": i + j} for j in range(len(batch))]
        )
    print(f"Stored {len(chunks)} chunks from '{source}'")

# Index your documents
embed_and_store(chunks, source="anthropic_usage_policy")

The PersistentClient saves the database to disk — your embeddings survive between runs without re-indexing.

Step 4: Retrieve Relevant Chunks

At query time, embed the user's question using the same model (with input_type="query" — this matters for retrieval quality), then search ChromaDB for the closest chunks.

pythondef retrieve(query: str, top_k: int = 5) -> list[dict]:
    """Embed a query and return the top-k most relevant chunks."""
    query_embedding = voyage_client.embed(
        [query], model="voyage-3", input_type="query"
    ).embeddings[0]

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    chunks = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        chunks.append({
            "text": doc,
            "source": meta["source"],
            "score": 1 - dist  # convert cosine distance to similarity
        })
    return chunks

Retrieval tuning levers:

Parameter	Effect
`top_k`	More context vs. more noise — start at 5, tune up to 10
`input_type="query"`	Improves retrieval by using a query-optimized embedding mode
`hnsw:space="cosine"`	Better than L2 for semantic text similarity
Score threshold	Filter out chunks below 0.65 similarity to cut noise

Step 5: Generate with Claude

Now we build the prompt and call the Claude API. The key pattern is the RAG system prompt — it tells Claude to answer only from the provided context and to cite its sources.

pythonimport anthropic

claude_client = anthropic.Anthropic()

RAG_SYSTEM_PROMPT = """You are a helpful assistant that answers questions based ONLY on the provided context.

Rules:
1. Answer using only the information in the context sections below.
2. If the context does not contain enough information to answer, say "I don't have enough information in my knowledge base to answer that."
3. Cite the source at the end of your answer using [Source: <source name>].
4. Be concise but complete — include all relevant details from the context."""

def ask(query: str) -> str:
    """Full RAG pipeline: retrieve context, then generate an answer."""
    # 1. Retrieve relevant chunks
    retrieved = retrieve(query, top_k=5)

    # 2. Format context block
    context_blocks = []
    for i, chunk in enumerate(retrieved):
        context_blocks.append(
            f"[Context {i+1} | Source: {chunk['source']} | Score: {chunk['score']:.2f}]\n{chunk['text']}"
        )
    context_str = "\n\n---\n\n".join(context_blocks)

    # 3. Build the user message
    user_message = f"""Context:
{context_str}

---

Question: {query}"""

    # 4. Call Claude
    response = claude_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=RAG_SYSTEM_PROMPT,
        messages=[{"role": "user", "content": user_message}]
    )

    return response.content[0].text

# Try it
answer = ask("What types of content does Anthropic's usage policy prohibit?")
print(answer)

Step 6: Putting It All Together

Here's a minimal CLI wrapper that ties every step into a single script:

python# rag_pipeline.py
import os
import anthropic
import chromadb
import voyageai
from pypdf import PdfReader

# --- Config ---
EMBED_MODEL = "voyage-3"
CLAUDE_MODEL = "claude-sonnet-4-6"
CHUNK_SIZE = 512
CHUNK_OVERLAP = 64
TOP_K = 5

# --- Clients ---
voyage = voyageai.Client()
chroma = chromadb.PersistentClient(path="./chroma_db")
claude = anthropic.Anthropic()
collection = chroma.get_or_create_collection("docs", metadata={"hnsw:space": "cosine"})

# --- Core functions (from above) ---
# load_pdf(), chunk_text(), embed_and_store(), retrieve(), ask()
# ... (paste functions here)

if __name__ == "__main__":
    import sys

    if sys.argv[1] == "index":
        path = sys.argv[2]
        text = load_pdf(path) if path.endswith(".pdf") else open(path).read()
        chunks = chunk_text(text)
        embed_and_store(chunks, source=os.path.basename(path))

    elif sys.argv[1] == "ask":
        query = " ".join(sys.argv[2:])
        print(ask(query))

Usage:

bash# Index a document
python rag_pipeline.py index docs/my_policy.pdf

# Ask a question
python rag_pipeline.py ask "What are the key safety requirements?"

Common RAG Pitfalls (and How to Fix Them)

Problem	Symptom	Fix
Chunks too large	Irrelevant context dilutes the answer	Reduce `chunk_size` to 256-512
Wrong `input_type`	Low retrieval accuracy	Use `input_type="query"` for questions, `"document"` for indexing
No overlap	Answers miss cross-chunk context	Set overlap to 10-15% of chunk size
Stale index	Answers based on old docs	Re-run `embed_and_store()` after doc updates, use source-based upsert
Hallucination persists	Claude adds info not in context	Harden system prompt: "never add information not present in the context"
Cosine score too low	Retrieved chunks are irrelevant	Filter results where `score < 0.65` and return "not enough info"

Scaling Beyond the Basics

Once your prototype works, here's what production RAG looks like:

Hybrid Search — combine vector similarity with BM25 keyword search. Vector search handles semantic queries ("what does X mean"), BM25 handles exact-match queries ("find all mentions of clause 4.2"). Libraries like rank_bm25 integrate cleanly. Reranking — after retrieving 20 candidates, use a cross-encoder reranker (Voyage's rerank-2 model works well with voyage_client.rerank()) to reorder them by relevance before passing to Claude. This alone typically improves answer quality by 15-20%. Metadata filtering — store rich metadata with each chunk (author, date, document type, department) and filter at query time: collection.query(..., where={"source": "legal_docs"}). Streaming responses — for UI responsiveness, use claude_client.messages.stream() to stream tokens as they generate. Evaluation — track answer quality with RAGAS metrics (faithfulness, context precision, context recall). Run automated evals before every model or prompt change.

Key Takeaways

RAG separates knowledge storage (vector DB) from reasoning (Claude), making your AI grounded and auditable
Chunking strategy and embedding quality are the biggest levers on retrieval accuracy — tune these before changing the LLM
Always use input_type="query" vs "document" when using Voyage AI embeddings
The RAG system prompt must explicitly constrain Claude to the provided context to prevent hallucination
Production RAG adds hybrid search + reranking on top of the baseline vector retrieval

Next Steps

Ready to go deeper? If you're preparing for the Claude Certified Architect (CCA) exam, RAG architecture is a core topic — you'll need to understand retrieval strategies, chunking tradeoffs, and prompt patterns cold.

Explore our CCA Practice Test Bank — 200+ questions covering RAG, multi-agent systems, tool use, and the Anthropic API in exam format. Free sample questions available, no signup required.

You might also want to read our guides on Claude multi-agent orchestration and Claude API prompt caching — both are natural extensions once your RAG pipeline is running in production.

Sources:

How to Build a RAG System with Claude API (Complete Tutorial 2026)