Claude API Prompt Caching: Complete Guide to Cutting API Costs by 90%
Learn how Claude's prompt caching works, when to use it, and how to implement it with code examples. Save up to 90% on API input token costs.
Claude API Prompt Caching: Cut Your API Bill by Up to 90%
If you're using the Claude API at scale, your biggest cost driver is almost always input tokens — and most of those tokens are identical across every request. Your system prompt doesn't change. Your product documentation doesn't change. Your few-shot examples don't change.
Prompt caching lets Claude "remember" the processed version of your prompt's static content so you don't pay full price to re-process it on every call. The result: up to 90% savings on input costs for the repeated portions.
This guide covers everything you need to know — how it works, when to use it, implementation with code, and real-world ROI calculations.
What Is Claude Prompt Caching?
Prompt caching is an Anthropic API feature that stores a preprocessed snapshot of your prompt's static sections on Anthropic's servers. When a subsequent request starts with the same cached content, Claude skips reprocessing those tokens and charges you at the cache read rate instead of the full input rate.
The economics are straightforward:
| Token type | Cost relative to standard input |
|---|---|
| Normal input tokens | 1× (baseline) |
| Cache write tokens | 1.25× (slight premium for storage) |
| Cache read tokens | 0.1× (90% discount) |
The math pays off fast: after just one cache hit, you've already saved money compared to processing everything fresh each time. After two hits, you've fully paid back the write premium and started banking real savings.
Cache entries expire after 5 minutes by default (ephemeral caching). Extended caching with longer TTLs is available in beta for higher-volume use cases.
When Should You Use Prompt Caching?
Caching has a small overhead on first use (the write cost), so it's not always the right tool. Here's a quick decision framework:
Cache it when:- Your system prompt is longer than ~1,000 tokens
- You're processing the same document or knowledge base repeatedly
- You make multiple API calls per session or per user interaction
- You have few-shot examples that stay constant across calls
- Your entire prompt is highly dynamic (changes every request)
- You're making one-off queries with no repeated content
- Your static sections are under 500 tokens (overhead not worth it)
- You need sub-second latency on first response (cache writes add a few hundred ms)
The clearest signal: if you copy-paste the same block of text into every API request, that's your caching candidate.
How Prompt Caching Works Under the Hood
When you mark a prompt section for caching with cache_control, Anthropic stores the KV (key-value) cache of the processed transformer state for that prefix. On subsequent requests, if your prompt starts with the same cached prefix, Claude reads from the stored state rather than recomputing it.
A few important mechanics to understand:
cache_control to tell Claude where the static prefix ends and the dynamic section begins.Step-by-Step Implementation
Basic Example: Caching a System Prompt
Here's the minimal implementation in Python using the Anthropic SDK:
pythonimport anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": """You are a senior software engineer specializing in Python and distributed systems.
You follow these principles when reviewing code:
- Prioritize correctness and safety over performance
- Flag any security vulnerabilities (SQL injection, XSS, command injection)
- Suggest specific, actionable improvements with code examples
- Explain the WHY behind each suggestion, not just the what
- Check for error handling, edge cases, and resource cleanup
You are thorough but concise — avoid repeating yourself or adding filler commentary.
Always format your review with: Summary, Issues Found (severity-ranked), Suggestions, and Score (1-10).""",
"cache_control": {"type": "ephemeral"} # <-- marks this for caching
}
],
messages=[
{
"role": "user",
"content": "Please review this Python function:\n\n}
]
)
print(response.content[0].text)
Check cache performance in usage stats
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
On the **first request**, you'll see `cache_creation_input_tokens` populated. On subsequent requests with the same system prompt, `cache_read_input_tokens` will populate instead — at 10% of the cost.
### Caching Large Documents or Knowledge Bases
The real power of prompt caching shows up when you're injecting large context — product documentation, codebases, legal documents, etc.:import anthropic
client = anthropic.Anthropic()
Imagine this is your 50,000-token product manual
product_manual = open("product_manual.txt").read()
def answer_support_question(customer_question: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[
{
"type": "text",
"text": "You are a helpful customer support agent. Answer questions based on the product documentation below.\n\n" + product_manual,
"cache_control": {"type": "ephemeral"} # Cache the manual
}
],
messages=[
{
"role": "user",
"content": customer_question
}
]
)
return response.content[0].text
First call: writes cache (1.25x cost for the manual tokens)
answer_support_question("How do I reset my password?")
Every subsequent call: reads from cache (0.1x cost)
answer_support_question("What payment methods do you accept?")
answer_support_question("How do I cancel my subscription?")
With a 50,000-token product manual and 100 queries per day, the savings are significant (see the ROI section below).
### Caching in Multi-Turn Conversations
For chatbots, you can cache the system prompt and grow the conversation:import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = """You are Zeon, an AI learning assistant specialized in Claude and Anthropic APIs.
You help developers understand how to build effective AI applications.
[... 2000 more tokens of instructions and context ...]"""
conversation_history = []
def chat(user_message: str) -> str:
conversation_history.append({
"role": "user",
"content": user_message
})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # System prompt cached every turn
}
],
messages=conversation_history
)
assistant_message = response.content[0].text
conversation_history.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
System prompt is cached after turn 1 — savings compound across the conversation
chat("What is prompt caching?")
chat("How does it work with multi-turn conversations?")
chat("Show me a code example")
## Real-World ROI: How Much Can You Actually Save?
### Scenario 1: Customer Support Bot with Product Manual
- Product manual: 50,000 tokens
- Daily queries: 100
- Model: Claude Sonnet 4.6 (input: ~$3/1M tokens)
**Without caching:**100 queries × 50,000 tokens × $0.000003 = $15.00/day = $450/month
**With caching:**Cache write (once per 5 min, ~20/day): 20 × 50,000 × $0.000003 × 1.25 = $3.75/day
Cache reads (80/day): 80 × 50,000 × $0.000003 × 0.1 = $1.20/day
Total: $4.95/day = ~$148/month
**Monthly savings: $302 (67% reduction)**
### Scenario 2: Code Review Pipeline
- Codebase context: 200,000 tokens
- Weekly PR reviews: 50
- Model: Claude Opus 4.6 (input: ~$15/1M tokens)
**Without caching:**50 reviews × 200,000 tokens × $0.000015 = $150/week = $600/month
**With caching:**Cache write: $15 × 1.25 = $18.75 (once per session)
Cache reads (50 reviews): 50 × 200,000 × $0.0000015 = $15.00/week
Monthly cost: ~$75
**Monthly savings: $525 (87% reduction)**
### Scenario 3: Content Generation with Fixed Templates
- Template + examples: 5,000 tokens
- Blog posts per month: 80
- Model: Claude Sonnet 4.6
**Without caching:**80 × 5,000 × $0.000003 = $1.20/month (low volume — caching payoff is smaller)
**With caching:**Cache writes + reads: ~$0.50/month
For low-volume use cases under ~5,000 tokens, the savings are real but modest. Caching becomes transformative at **high token counts and high request volumes**.
## Tracking Cache Performance
Don't guess — measure. The API response includes detailed usage data:def log_cache_metrics(response):
usage = response.usage
input_tokens = usage.input_tokens
cache_writes = getattr(usage, 'cache_creation_input_tokens', 0)
cache_reads = getattr(usage, 'cache_read_input_tokens', 0)
total_effective_input = cache_writes 1.25 + cache_reads 0.1 + input_tokens
cache_hit_rate = cache_reads / (cache_writes + cache_reads + 0.001) * 100
print(f"Cache hit rate: {cache_hit_rate:.1f}%")
print(f"Effective cost multiplier: {total_effective_input / (cache_writes + cache_reads + input_tokens):.2f}x")
return {
"cache_hit_rate": cache_hit_rate,
"cache_writes": cache_writes,
"cache_reads": cache_reads,
"input_tokens": input_tokens
}
**Target benchmarks:**
- Cache hit rate > 70%: Excellent — you've optimized well
- Cache hit rate 40-70%: Good — room for improvement
- Cache hit rate < 40%: Investigate why misses are high (TTL expiry, content variation, short sessions)
## Common Mistakes to Avoid
**1. Putting dynamic content before static content**
The cache key is prefix-based. If your dynamic user input comes before your system prompt in the message, the cache never hits.Wrong — user input breaks cache key
system = user_input + "\n\n" + static_instructions
Right — static content first, dynamic content after
system = static_instructions # cached
messages = [{"role": "user", "content": user_input}] # not cached
**2. Not accounting for the 5-minute TTL in low-frequency apps**
If your app makes one request every 10 minutes per user, your cache will frequently expire between calls. You'll be paying cache write costs (1.25x) without getting cache read savings. Either batch requests, warm the cache proactively, or accept that caching isn't the right tool for that workload.
**3. Caching content that actually changes**
User preferences, real-time data, session state — these don't belong in the cached prefix. If you cache content that changes across requests, you'll get cache misses or — worse — stale context.
**4. Forgetting to check cache metrics after deployment**
Set up logging from day one. A silent cache miss in production means you're paying full price and not knowing it.
**5. Over-caching short prompts**
If your static section is under 500 tokens, the overhead of managing cache writes may not be worth the savings. Run the math for your specific token counts and request volumes.
## Advanced Pattern: Combining Caching with Batching
For maximum cost reduction, combine prompt caching with the Messages Batching API:import anthropic
client = anthropic.Anthropic()
Collect 10 customer questions before sending
batch_requests = []
for question in customer_questions_queue:
batch_requests.append({
"custom_id": f"question-{question.id}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 512,
"system": [
{
"type": "text",
"text": product_manual_text,
"cache_control": {"type": "ephemeral"}
}
],
"messages": [{"role": "user", "content": question.text}]
}
})
Batching gives 50% discount on API pricing
Caching gives 90% discount on input tokens
Combined: very significant cost reduction
batch = client.beta.messages.batches.create(requests=batch_requests)
```
Batching and caching stack independently — the batch discount applies to the overall API pricing tier, while caching reduces the effective token cost within each request.
Key Takeaways
- Cache write cost is 1.25× normal input; cache read cost is 0.1× — you break even after 1-2 cache hits
- 5-minute TTL — design your caching strategy around this for high-frequency workloads
- Static prefix first — cache keys are prefix-based, so put documents/instructions before user content
- Measure everything —
cache_creation_input_tokensandcache_read_input_tokensin the API response tell you exactly what's happening - High ROI scenarios: large context documents (50K+ tokens), high request volume (100+/day), multi-turn conversations
Next Steps
Ready to dig deeper into Claude's API features?
- Claude API Tutorial for Beginners — Start here if you're new to the API
- Claude Multi-Agent Orchestration Tutorial — Build systems where agents hand off context efficiently (caching helps here too)
- Claude Certified Architect (CCA) Exam Guide — Prompt caching is testable on the CCA exam
If you're preparing for the Claude Certified Architect certification, prompt caching is a topic you need to understand cold — both the implementation details and the cost model. Our CCA Practice Test Bank includes questions on caching strategies, API optimization, and cost management.
Start practicing for free → Take our sample CCA questions and see where your Claude API knowledge stands.Ready to Start Practicing?
300+ scenario-based practice questions covering all 5 CCA domains. Detailed explanations for every answer.
Free CCA Study Kit
Get domain cheat sheets, anti-pattern flashcards, and weekly exam tips. No spam, unsubscribe anytime.