Claude API Prompt Caching: Cut Your API Bill by Up to 90%

If you're using the Claude API at scale, your biggest cost driver is almost always input tokens — and most of those tokens are identical across every request. Your system prompt doesn't change. Your product documentation doesn't change. Your few-shot examples don't change.

Prompt caching lets Claude "remember" the processed version of your prompt's static content so you don't pay full price to re-process it on every call. The result: up to 90% savings on input costs for the repeated portions.

This guide covers everything you need to know — how it works, when to use it, implementation with code, and real-world ROI calculations.

What Is Claude Prompt Caching?

Prompt caching is an Anthropic API feature that stores a preprocessed snapshot of your prompt's static sections on Anthropic's servers. When a subsequent request starts with the same cached content, Claude skips reprocessing those tokens and charges you at the cache read rate instead of the full input rate.

The economics are straightforward:

Token type	Cost relative to standard input
Normal input tokens	1× (baseline)
Cache write tokens	1.25× (slight premium for storage)
Cache read tokens	0.1× (90% discount)

The math pays off fast: after just one cache hit, you've already saved money compared to processing everything fresh each time. After two hits, you've fully paid back the write premium and started banking real savings.

Cache entries expire after 5 minutes by default (ephemeral caching). Extended caching with longer TTLs is available in beta for higher-volume use cases.

When Should You Use Prompt Caching?

Caching has a small overhead on first use (the write cost), so it's not always the right tool. Here's a quick decision framework:

Cache it when:

Your system prompt is longer than ~1,000 tokens
You're processing the same document or knowledge base repeatedly
You make multiple API calls per session or per user interaction
You have few-shot examples that stay constant across calls

Skip caching when:

Your entire prompt is highly dynamic (changes every request)
You're making one-off queries with no repeated content
Your static sections are under 500 tokens (overhead not worth it)
You need sub-second latency on first response (cache writes add a few hundred ms)

The clearest signal: if you copy-paste the same block of text into every API request, that's your caching candidate.

How Prompt Caching Works Under the Hood

When you mark a prompt section for caching with cache_control, Anthropic stores the KV (key-value) cache of the processed transformer state for that prefix. On subsequent requests, if your prompt starts with the same cached prefix, Claude reads from the stored state rather than recomputing it.

A few important mechanics to understand:

Cache keys are prefix-based. The cache only matches if your request starts with the exact same cached content. Even a single character difference at the start of the cached section causes a cache miss.

Order matters. Place your static content (system prompt, documents, instructions) at the start of your prompt. Dynamic content (user input, variables) goes after the cached section.

Breakpoints define boundaries. You use cache_control to tell Claude where the static prefix ends and the dynamic section begins.

The 5-minute TTL resets on use. Each cache read extends the TTL — so in an active conversation, the cache stays warm as long as you're making requests.

Step-by-Step Implementation

Basic Example: Caching a System Prompt

Here's the minimal implementation in Python using the Anthropic SDK:

pythonimport anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are a senior software engineer specializing in Python and distributed systems.
            
            You follow these principles when reviewing code:
            - Prioritize correctness and safety over performance
            - Flag any security vulnerabilities (SQL injection, XSS, command injection)
            - Suggest specific, actionable improvements with code examples
            - Explain the WHY behind each suggestion, not just the what
            - Check for error handling, edge cases, and resource cleanup
            
            You are thorough but concise — avoid repeating yourself or adding filler commentary.
            Always format your review with: Summary, Issues Found (severity-ranked), Suggestions, and Score (1-10).""",
            "cache_control": {"type": "ephemeral"}  # <-- marks this for caching
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "Please review this Python function:\n\n

python\ndef get_user(db, user_id):\n query = f'SELECT * FROM users WHERE id = {user_id}'\n return db.execute(query).fetchone()\n```"

}

]

)

print(response.content[0].text)

Check cache performance in usage stats

usage = response.usage

print(f"Input tokens: {usage.input_tokens}")

print(f"Cache write tokens: {usage.cache_creation_input_tokens}")

print(f"Cache read tokens: {usage.cache_read_input_tokens}")


On the **first request**, you'll see `cache_creation_input_tokens` populated. On subsequent requests with the same system prompt, `cache_read_input_tokens` will populate instead — at 10% of the cost.

### Caching Large Documents or Knowledge Bases

The real power of prompt caching shows up when you're injecting large context — product documentation, codebases, legal documents, etc.:

python

import anthropic

client = anthropic.Anthropic()

Imagine this is your 50,000-token product manual

product_manual = open("product_manual.txt").read()

def answer_support_question(customer_question: str) -> str:

response = client.messages.create(

model="claude-sonnet-4-6",

max_tokens=512,

system=[

{

"type": "text",

"text": "You are a helpful customer support agent. Answer questions based on the product documentation below.\n\n" + product_manual,

"cache_control": {"type": "ephemeral"} # Cache the manual

}

messages=[

{

"role": "user",

"content": customer_question

}

]

)

return response.content[0].text

First call: writes cache (1.25x cost for the manual tokens)

answer_support_question("How do I reset my password?")

Every subsequent call: reads from cache (0.1x cost)

answer_support_question("What payment methods do you accept?")

answer_support_question("How do I cancel my subscription?")


With a 50,000-token product manual and 100 queries per day, the savings are significant (see the ROI section below).

### Caching in Multi-Turn Conversations

For chatbots, you can cache the system prompt and grow the conversation:

python

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are Zeon, an AI learning assistant specialized in Claude and Anthropic APIs.

You help developers understand how to build effective AI applications.

[... 2000 more tokens of instructions and context ...]"""

conversation_history = []

def chat(user_message: str) -> str:

conversation_history.append({

"role": "user",

"content": user_message

})

response = client.messages.create(

model="claude-sonnet-4-6",

max_tokens=1024,

system=[

{

"type": "text",

"text": SYSTEM_PROMPT,

"cache_control": {"type": "ephemeral"} # System prompt cached every turn

}

messages=conversation_history

)

assistant_message = response.content[0].text

conversation_history.append({

"role": "assistant",

"content": assistant_message

})

return assistant_message

System prompt is cached after turn 1 — savings compound across the conversation

chat("What is prompt caching?")

chat("How does it work with multi-turn conversations?")

chat("Show me a code example")


## Real-World ROI: How Much Can You Actually Save?

### Scenario 1: Customer Support Bot with Product Manual

- Product manual: 50,000 tokens
- Daily queries: 100
- Model: Claude Sonnet 4.6 (input: ~$3/1M tokens)

**Without caching:**

100 queries × 50,000 tokens × $0.000003 = $15.00/day = $450/month


**With caching:**

Cache write (once per 5 min, ~20/day): 20 × 50,000 × $0.000003 × 1.25 = $3.75/day

Cache reads (80/day): 80 × 50,000 × $0.000003 × 0.1 = $1.20/day

Total: $4.95/day = ~$148/month


**Monthly savings: $302 (67% reduction)**

### Scenario 2: Code Review Pipeline

- Codebase context: 200,000 tokens
- Weekly PR reviews: 50
- Model: Claude Opus 4.6 (input: ~$15/1M tokens)

**Without caching:**

50 reviews × 200,000 tokens × $0.000015 = $150/week = $600/month


**With caching:**

Cache write: $15 × 1.25 = $18.75 (once per session)

Cache reads (50 reviews): 50 × 200,000 × $0.0000015 = $15.00/week

Monthly cost: ~$75


**Monthly savings: $525 (87% reduction)**

### Scenario 3: Content Generation with Fixed Templates

- Template + examples: 5,000 tokens
- Blog posts per month: 80
- Model: Claude Sonnet 4.6

**Without caching:**

80 × 5,000 × $0.000003 = $1.20/month (low volume — caching payoff is smaller)


**With caching:**

Cache writes + reads: ~$0.50/month


For low-volume use cases under ~5,000 tokens, the savings are real but modest. Caching becomes transformative at **high token counts and high request volumes**.

## Tracking Cache Performance

Don't guess — measure. The API response includes detailed usage data:

python

def log_cache_metrics(response):

usage = response.usage

input_tokens = usage.input_tokens

cache_writes = getattr(usage, 'cache_creation_input_tokens', 0)

cache_reads = getattr(usage, 'cache_read_input_tokens', 0)

total_effective_input = cache_writes 1.25 + cache_reads 0.1 + input_tokens

cache_hit_rate = cache_reads / (cache_writes + cache_reads + 0.001) * 100

print(f"Cache hit rate: {cache_hit_rate:.1f}%")

print(f"Effective cost multiplier: {total_effective_input / (cache_writes + cache_reads + input_tokens):.2f}x")

return {

"cache_hit_rate": cache_hit_rate,

"cache_writes": cache_writes,

"cache_reads": cache_reads,

"input_tokens": input_tokens

}


**Target benchmarks:**
- Cache hit rate > 70%: Excellent — you've optimized well
- Cache hit rate 40-70%: Good — room for improvement
- Cache hit rate < 40%: Investigate why misses are high (TTL expiry, content variation, short sessions)

## Common Mistakes to Avoid

**1. Putting dynamic content before static content**

The cache key is prefix-based. If your dynamic user input comes before your system prompt in the message, the cache never hits.

python

Wrong — user input breaks cache key

system = user_input + "\n\n" + static_instructions

Right — static content first, dynamic content after

system = static_instructions # cached

messages = [{"role": "user", "content": user_input}] # not cached


**2. Not accounting for the 5-minute TTL in low-frequency apps**

If your app makes one request every 10 minutes per user, your cache will frequently expire between calls. You'll be paying cache write costs (1.25x) without getting cache read savings. Either batch requests, warm the cache proactively, or accept that caching isn't the right tool for that workload.

**3. Caching content that actually changes**

User preferences, real-time data, session state — these don't belong in the cached prefix. If you cache content that changes across requests, you'll get cache misses or — worse — stale context.

**4. Forgetting to check cache metrics after deployment**

Set up logging from day one. A silent cache miss in production means you're paying full price and not knowing it.

**5. Over-caching short prompts**

If your static section is under 500 tokens, the overhead of managing cache writes may not be worth the savings. Run the math for your specific token counts and request volumes.

## Advanced Pattern: Combining Caching with Batching

For maximum cost reduction, combine prompt caching with the Messages Batching API:

python

import anthropic

client = anthropic.Anthropic()

Collect 10 customer questions before sending

batch_requests = []

for question in customer_questions_queue:

batch_requests.append({

"custom_id": f"question-{question.id}",

"params": {

"model": "claude-sonnet-4-6",

"max_tokens": 512,

"system": [

{

"type": "text",

"text": product_manual_text,

"cache_control": {"type": "ephemeral"}

}

"messages": [{"role": "user", "content": question.text}]

}

})

Batching gives 50% discount on API pricing

Caching gives 90% discount on input tokens

Combined: very significant cost reduction

batch = client.beta.messages.batches.create(requests=batch_requests)

```

Batching and caching stack independently — the batch discount applies to the overall API pricing tier, while caching reduces the effective token cost within each request.

Key Takeaways

Cache write cost is 1.25× normal input; cache read cost is 0.1× — you break even after 1-2 cache hits
5-minute TTL — design your caching strategy around this for high-frequency workloads
Static prefix first — cache keys are prefix-based, so put documents/instructions before user content
Measure everything — cache_creation_input_tokens and cache_read_input_tokens in the API response tell you exactly what's happening
High ROI scenarios: large context documents (50K+ tokens), high request volume (100+/day), multi-turn conversations

Next Steps

Ready to dig deeper into Claude's API features?

Claude API Tutorial for Beginners — Start here if you're new to the API
Claude Multi-Agent Orchestration Tutorial — Build systems where agents hand off context efficiently (caching helps here too)
Claude Certified Architect (CCA) Exam Guide — Prompt caching is testable on the CCA exam

If you're preparing for the Claude Certified Architect certification, prompt caching is a topic you need to understand cold — both the implementation details and the cost model. Our CCA Practice Test Bank includes questions on caching strategies, API optimization, and cost management.

Start practicing for free → Take our sample CCA questions and see where your Claude API knowledge stands.

Claude API Prompt Caching: Complete Guide to Cutting API Costs by 90%

Claude API Prompt Caching: Cut Your API Bill by Up to 90%

What Is Claude Prompt Caching?

When Should You Use Prompt Caching?

How Prompt Caching Works Under the Hood

Step-by-Step Implementation

Basic Example: Caching a System Prompt

Check cache performance in usage stats

Imagine this is your 50,000-token product manual

First call: writes cache (1.25x cost for the manual tokens)

Every subsequent call: reads from cache (0.1x cost)

System prompt is cached after turn 1 — savings compound across the conversation

Wrong — user input breaks cache key

Right — static content first, dynamic content after

Collect 10 customer questions before sending

Batching gives 50% discount on API pricing

Caching gives 90% discount on input tokens

Combined: very significant cost reduction

Key Takeaways

Next Steps

Ready to Start Practicing?