How to Build a RAG System with Claude API (Complete Tutorial 2026)
Step-by-step tutorial to build a production-ready Retrieval-Augmented Generation system using the Claude API, Python, and ChromaDB. Includes full code examples.
How to Build a RAG System with the Claude API (2026 Tutorial)
Most AI demos are impressive. Most production AI apps disappoint — because they answer questions from thin air instead of your actual data.
Retrieval-Augmented Generation (RAG) closes that gap. Instead of relying on what the model memorized during training, RAG fetches relevant context from your documents at query time and passes it to Claude. The result: grounded, accurate, citable answers that don't hallucinate.
This tutorial walks you through building a working RAG pipeline from scratch using the Claude API, ChromaDB (a lightweight vector database), and Python. By the end you'll have a system that can answer questions over a custom document set — and you'll understand every moving part.
What Is RAG and Why Does It Matter?
Retrieval-Augmented Generation was introduced in a 2020 Meta paper and has since become the dominant pattern for deploying LLMs over private knowledge bases. The idea is simple:
RAG matters because:
- LLMs don't know your data. Claude's training cutoff means it can't answer questions about your internal docs, recent product updates, or proprietary knowledge.
- Context windows aren't infinite workarounds. Dumping entire codebases or documentation sets into every prompt is expensive and slow.
- Hallucination rates drop sharply when models are given explicit retrieved context to cite.
For the Claude Certified Architect (CCA) exam, RAG is listed as a core architectural pattern. Knowing how to implement it correctly — chunking strategy, embedding choice, retrieval scoring — is exam-critical and career-critical.
Architecture Overview
Here's the full pipeline we're building:
Documents (PDFs, text, markdown)
↓
Chunker (RecursiveTextSplitter)
↓
Embedder (Voyage AI via Claude SDK)
↓
Vector Store (ChromaDB — local)
↓
Query Engine
↓
Claude API (claude-sonnet-4-6) → Answeranthropic— the official Claude Python SDKchromadb— lightweight, file-backed vector databasevoyageai— Anthropic's preferred embedding model (or usesentence-transformersif offline)pypdf— PDF loading
Step 1: Install Dependencies
bashpip install anthropic chromadb voyageai pypdfSet your API keys:
bashexport ANTHROPIC_API_KEY="sk-ant-..."
export VOYAGE_API_KEY="pa-..." # free tier available at voyageai.comStep 2: Load and Chunk Your Documents
Chunking is the most underrated step in RAG. Too small and chunks lose context; too large and retrieval becomes noisy. A chunk size of 512 tokens with 64-token overlap is a solid starting point for technical documentation.
pythonimport os
from pathlib import Path
from pypdf import PdfReader
def load_pdf(file_path: str) -> str:
"""Extract all text from a PDF."""
reader = PdfReader(file_path)
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
return text
def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
"""Split text into overlapping chunks by word count."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
start += chunk_size - overlap
return chunks
# Example: load a PDF and chunk it
raw_text = load_pdf("docs/anthropic_usage_policy.pdf")
chunks = chunk_text(raw_text)
print(f"Loaded {len(chunks)} chunks")- For code: chunk by function/class boundary, not word count
- For markdown: split on
##headings to preserve semantic units - For structured data: consider JSON chunks with metadata fields
Step 3: Embed and Store in ChromaDB
Now we embed each chunk into a vector and store it in ChromaDB. We'll use Voyage AI's voyage-3 model — it's optimized for retrieval and is the embedding model Anthropic recommends in their official cookbook.
pythonimport voyageai
import chromadb
# Initialize clients
voyage_client = voyageai.Client()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
# Create (or load) a collection
collection = chroma_client.get_or_create_collection(
name="my_documents",
metadata={"hnsw:space": "cosine"} # cosine similarity for text
)
def embed_and_store(chunks: list[str], source: str = "unknown"):
"""Embed chunks and upsert into ChromaDB."""
# Voyage AI supports batch embedding (up to 128 at once)
batch_size = 64
for i in range(0, len(chunks), batch_size):
batch = chunks[i : i + batch_size]
result = voyage_client.embed(batch, model="voyage-3", input_type="document")
embeddings = result.embeddings
collection.upsert(
ids=[f"{source}-chunk-{i + j}" for j in range(len(batch))],
embeddings=embeddings,
documents=batch,
metadatas=[{"source": source, "chunk_index": i + j} for j in range(len(batch))]
)
print(f"Stored {len(chunks)} chunks from '{source}'")
# Index your documents
embed_and_store(chunks, source="anthropic_usage_policy")The PersistentClient saves the database to disk — your embeddings survive between runs without re-indexing.
Step 4: Retrieve Relevant Chunks
At query time, embed the user's question using the same model (with input_type="query" — this matters for retrieval quality), then search ChromaDB for the closest chunks.
pythondef retrieve(query: str, top_k: int = 5) -> list[dict]:
"""Embed a query and return the top-k most relevant chunks."""
query_embedding = voyage_client.embed(
[query], model="voyage-3", input_type="query"
).embeddings[0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
chunks = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
chunks.append({
"text": doc,
"source": meta["source"],
"score": 1 - dist # convert cosine distance to similarity
})
return chunks| Parameter | Effect |
|---|---|
top_k | More context vs. more noise — start at 5, tune up to 10 |
input_type="query" | Improves retrieval by using a query-optimized embedding mode |
hnsw:space="cosine" | Better than L2 for semantic text similarity |
| Score threshold | Filter out chunks below 0.65 similarity to cut noise |
Step 5: Generate with Claude
Now we build the prompt and call the Claude API. The key pattern is the RAG system prompt — it tells Claude to answer only from the provided context and to cite its sources.
pythonimport anthropic
claude_client = anthropic.Anthropic()
RAG_SYSTEM_PROMPT = """You are a helpful assistant that answers questions based ONLY on the provided context.
Rules:
1. Answer using only the information in the context sections below.
2. If the context does not contain enough information to answer, say "I don't have enough information in my knowledge base to answer that."
3. Cite the source at the end of your answer using [Source: <source name>].
4. Be concise but complete — include all relevant details from the context."""
def ask(query: str) -> str:
"""Full RAG pipeline: retrieve context, then generate an answer."""
# 1. Retrieve relevant chunks
retrieved = retrieve(query, top_k=5)
# 2. Format context block
context_blocks = []
for i, chunk in enumerate(retrieved):
context_blocks.append(
f"[Context {i+1} | Source: {chunk['source']} | Score: {chunk['score']:.2f}]\n{chunk['text']}"
)
context_str = "\n\n---\n\n".join(context_blocks)
# 3. Build the user message
user_message = f"""Context:
{context_str}
---
Question: {query}"""
# 4. Call Claude
response = claude_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=RAG_SYSTEM_PROMPT,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
# Try it
answer = ask("What types of content does Anthropic's usage policy prohibit?")
print(answer)Step 6: Putting It All Together
Here's a minimal CLI wrapper that ties every step into a single script:
python# rag_pipeline.py
import os
import anthropic
import chromadb
import voyageai
from pypdf import PdfReader
# --- Config ---
EMBED_MODEL = "voyage-3"
CLAUDE_MODEL = "claude-sonnet-4-6"
CHUNK_SIZE = 512
CHUNK_OVERLAP = 64
TOP_K = 5
# --- Clients ---
voyage = voyageai.Client()
chroma = chromadb.PersistentClient(path="./chroma_db")
claude = anthropic.Anthropic()
collection = chroma.get_or_create_collection("docs", metadata={"hnsw:space": "cosine"})
# --- Core functions (from above) ---
# load_pdf(), chunk_text(), embed_and_store(), retrieve(), ask()
# ... (paste functions here)
if __name__ == "__main__":
import sys
if sys.argv[1] == "index":
path = sys.argv[2]
text = load_pdf(path) if path.endswith(".pdf") else open(path).read()
chunks = chunk_text(text)
embed_and_store(chunks, source=os.path.basename(path))
elif sys.argv[1] == "ask":
query = " ".join(sys.argv[2:])
print(ask(query))bash# Index a document
python rag_pipeline.py index docs/my_policy.pdf
# Ask a question
python rag_pipeline.py ask "What are the key safety requirements?"Common RAG Pitfalls (and How to Fix Them)
| Problem | Symptom | Fix |
|---|---|---|
| Chunks too large | Irrelevant context dilutes the answer | Reduce chunk_size to 256-512 |
Wrong input_type | Low retrieval accuracy | Use input_type="query" for questions, "document" for indexing |
| No overlap | Answers miss cross-chunk context | Set overlap to 10-15% of chunk size |
| Stale index | Answers based on old docs | Re-run embed_and_store() after doc updates, use source-based upsert |
| Hallucination persists | Claude adds info not in context | Harden system prompt: "never add information not present in the context" |
| Cosine score too low | Retrieved chunks are irrelevant | Filter results where score < 0.65 and return "not enough info" |
Scaling Beyond the Basics
Once your prototype works, here's what production RAG looks like:
Hybrid Search — combine vector similarity with BM25 keyword search. Vector search handles semantic queries ("what does X mean"), BM25 handles exact-match queries ("find all mentions of clause 4.2"). Libraries likerank_bm25 integrate cleanly.
Reranking — after retrieving 20 candidates, use a cross-encoder reranker (Voyage's rerank-2 model works well with voyage_client.rerank()) to reorder them by relevance before passing to Claude. This alone typically improves answer quality by 15-20%.
Metadata filtering — store rich metadata with each chunk (author, date, document type, department) and filter at query time: collection.query(..., where={"source": "legal_docs"}).
Streaming responses — for UI responsiveness, use claude_client.messages.stream() to stream tokens as they generate.
Evaluation — track answer quality with RAGAS metrics (faithfulness, context precision, context recall). Run automated evals before every model or prompt change.
Key Takeaways
- RAG separates knowledge storage (vector DB) from reasoning (Claude), making your AI grounded and auditable
- Chunking strategy and embedding quality are the biggest levers on retrieval accuracy — tune these before changing the LLM
- Always use
input_type="query"vs"document"when using Voyage AI embeddings - The RAG system prompt must explicitly constrain Claude to the provided context to prevent hallucination
- Production RAG adds hybrid search + reranking on top of the baseline vector retrieval
Next Steps
Ready to go deeper? If you're preparing for the Claude Certified Architect (CCA) exam, RAG architecture is a core topic — you'll need to understand retrieval strategies, chunking tradeoffs, and prompt patterns cold.
Explore our CCA Practice Test Bank — 200+ questions covering RAG, multi-agent systems, tool use, and the Anthropic API in exam format. Free sample questions available, no signup required.You might also want to read our guides on Claude multi-agent orchestration and Claude API prompt caching — both are natural extensions once your RAG pipeline is running in production.
Sources:
Ready to Start Practicing?
300+ scenario-based practice questions covering all 5 CCA domains. Detailed explanations for every answer.
Free CCA Study Kit
Get domain cheat sheets, anti-pattern flashcards, and weekly exam tips. No spam, unsubscribe anytime.