Claude API11 min read

Claude API Prompt Caching: Complete Guide to Cutting API Costs by 90%

Learn how Claude's prompt caching works, when to use it, and how to implement it with code examples. Save up to 90% on API input token costs.

Claude API Prompt Caching: Cut Your API Bill by Up to 90%

If you're using the Claude API at scale, your biggest cost driver is almost always input tokens — and most of those tokens are identical across every request. Your system prompt doesn't change. Your product documentation doesn't change. Your few-shot examples don't change.

Prompt caching lets Claude "remember" the processed version of your prompt's static content so you don't pay full price to re-process it on every call. The result: up to 90% savings on input costs for the repeated portions.

This guide covers everything you need to know — how it works, when to use it, implementation with code, and real-world ROI calculations.

What Is Claude Prompt Caching?

Prompt caching is an Anthropic API feature that stores a preprocessed snapshot of your prompt's static sections on Anthropic's servers. When a subsequent request starts with the same cached content, Claude skips reprocessing those tokens and charges you at the cache read rate instead of the full input rate.

The economics are straightforward:

Token typeCost relative to standard input
Normal input tokens1× (baseline)
Cache write tokens1.25× (slight premium for storage)
Cache read tokens0.1× (90% discount)

The math pays off fast: after just one cache hit, you've already saved money compared to processing everything fresh each time. After two hits, you've fully paid back the write premium and started banking real savings.

Cache entries expire after 5 minutes by default (ephemeral caching). Extended caching with longer TTLs is available in beta for higher-volume use cases.

When Should You Use Prompt Caching?

Caching has a small overhead on first use (the write cost), so it's not always the right tool. Here's a quick decision framework:

Cache it when:
  • Your system prompt is longer than ~1,000 tokens
  • You're processing the same document or knowledge base repeatedly
  • You make multiple API calls per session or per user interaction
  • You have few-shot examples that stay constant across calls

Skip caching when:
  • Your entire prompt is highly dynamic (changes every request)
  • You're making one-off queries with no repeated content
  • Your static sections are under 500 tokens (overhead not worth it)
  • You need sub-second latency on first response (cache writes add a few hundred ms)

The clearest signal: if you copy-paste the same block of text into every API request, that's your caching candidate.

How Prompt Caching Works Under the Hood

When you mark a prompt section for caching with cache_control, Anthropic stores the KV (key-value) cache of the processed transformer state for that prefix. On subsequent requests, if your prompt starts with the same cached prefix, Claude reads from the stored state rather than recomputing it.

A few important mechanics to understand:

  • Cache keys are prefix-based. The cache only matches if your request starts with the exact same cached content. Even a single character difference at the start of the cached section causes a cache miss.
  • Order matters. Place your static content (system prompt, documents, instructions) at the start of your prompt. Dynamic content (user input, variables) goes after the cached section.
  • Breakpoints define boundaries. You use cache_control to tell Claude where the static prefix ends and the dynamic section begins.
  • The 5-minute TTL resets on use. Each cache read extends the TTL — so in an active conversation, the cache stays warm as long as you're making requests.
  • Step-by-Step Implementation

    Basic Example: Caching a System Prompt

    Here's the minimal implementation in Python using the Anthropic SDK:

    pythonimport anthropic
    
    client = anthropic.Anthropic()
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": """You are a senior software engineer specializing in Python and distributed systems.
                
                You follow these principles when reviewing code:
                - Prioritize correctness and safety over performance
                - Flag any security vulnerabilities (SQL injection, XSS, command injection)
                - Suggest specific, actionable improvements with code examples
                - Explain the WHY behind each suggestion, not just the what
                - Check for error handling, edge cases, and resource cleanup
                
                You are thorough but concise — avoid repeating yourself or adding filler commentary.
                Always format your review with: Summary, Issues Found (severity-ranked), Suggestions, and Score (1-10).""",
                "cache_control": {"type": "ephemeral"}  # <-- marks this for caching
            }
        ],
        messages=[
            {
                "role": "user",
                "content": "Please review this Python function:\n\n
    python\ndef get_user(db, user_id):\n query = f'SELECT * FROM users WHERE id = {user_id}'\n return db.execute(query).fetchone()\n```"

    }

    ]

    )

    print(response.content[0].text)

    Check cache performance in usage stats

    usage = response.usage

    print(f"Input tokens: {usage.input_tokens}")

    print(f"Cache write tokens: {usage.cache_creation_input_tokens}")

    print(f"Cache read tokens: {usage.cache_read_input_tokens}")

    
    On the **first request**, you'll see `cache_creation_input_tokens` populated. On subsequent requests with the same system prompt, `cache_read_input_tokens` will populate instead — at 10% of the cost.
    
    ### Caching Large Documents or Knowledge Bases
    
    The real power of prompt caching shows up when you're injecting large context — product documentation, codebases, legal documents, etc.:
    python

    import anthropic

    client = anthropic.Anthropic()

    Imagine this is your 50,000-token product manual

    product_manual = open("product_manual.txt").read()

    def answer_support_question(customer_question: str) -> str:

    response = client.messages.create(

    model="claude-sonnet-4-6",

    max_tokens=512,

    system=[

    {

    "type": "text",

    "text": "You are a helpful customer support agent. Answer questions based on the product documentation below.\n\n" + product_manual,

    "cache_control": {"type": "ephemeral"} # Cache the manual

    }

    ],

    messages=[

    {

    "role": "user",

    "content": customer_question

    }

    ]

    )

    return response.content[0].text

    First call: writes cache (1.25x cost for the manual tokens)

    answer_support_question("How do I reset my password?")

    Every subsequent call: reads from cache (0.1x cost)

    answer_support_question("What payment methods do you accept?")

    answer_support_question("How do I cancel my subscription?")

    
    With a 50,000-token product manual and 100 queries per day, the savings are significant (see the ROI section below).
    
    ### Caching in Multi-Turn Conversations
    
    For chatbots, you can cache the system prompt and grow the conversation:
    python

    import anthropic

    client = anthropic.Anthropic()

    SYSTEM_PROMPT = """You are Zeon, an AI learning assistant specialized in Claude and Anthropic APIs.

    You help developers understand how to build effective AI applications.

    [... 2000 more tokens of instructions and context ...]"""

    conversation_history = []

    def chat(user_message: str) -> str:

    conversation_history.append({

    "role": "user",

    "content": user_message

    })

    response = client.messages.create(

    model="claude-sonnet-4-6",

    max_tokens=1024,

    system=[

    {

    "type": "text",

    "text": SYSTEM_PROMPT,

    "cache_control": {"type": "ephemeral"} # System prompt cached every turn

    }

    ],

    messages=conversation_history

    )

    assistant_message = response.content[0].text

    conversation_history.append({

    "role": "assistant",

    "content": assistant_message

    })

    return assistant_message

    System prompt is cached after turn 1 — savings compound across the conversation

    chat("What is prompt caching?")

    chat("How does it work with multi-turn conversations?")

    chat("Show me a code example")

    
    ## Real-World ROI: How Much Can You Actually Save?
    
    ### Scenario 1: Customer Support Bot with Product Manual
    
    - Product manual: 50,000 tokens
    - Daily queries: 100
    - Model: Claude Sonnet 4.6 (input: ~$3/1M tokens)
    
    **Without caching:**

    100 queries × 50,000 tokens × $0.000003 = $15.00/day = $450/month

    
    **With caching:**

    Cache write (once per 5 min, ~20/day): 20 × 50,000 × $0.000003 × 1.25 = $3.75/day

    Cache reads (80/day): 80 × 50,000 × $0.000003 × 0.1 = $1.20/day

    Total: $4.95/day = ~$148/month

    
    **Monthly savings: $302 (67% reduction)**
    
    ### Scenario 2: Code Review Pipeline
    
    - Codebase context: 200,000 tokens
    - Weekly PR reviews: 50
    - Model: Claude Opus 4.6 (input: ~$15/1M tokens)
    
    **Without caching:**

    50 reviews × 200,000 tokens × $0.000015 = $150/week = $600/month

    
    **With caching:**

    Cache write: $15 × 1.25 = $18.75 (once per session)

    Cache reads (50 reviews): 50 × 200,000 × $0.0000015 = $15.00/week

    Monthly cost: ~$75

    
    **Monthly savings: $525 (87% reduction)**
    
    ### Scenario 3: Content Generation with Fixed Templates
    
    - Template + examples: 5,000 tokens
    - Blog posts per month: 80
    - Model: Claude Sonnet 4.6
    
    **Without caching:**

    80 × 5,000 × $0.000003 = $1.20/month (low volume — caching payoff is smaller)

    
    **With caching:**

    Cache writes + reads: ~$0.50/month

    
    For low-volume use cases under ~5,000 tokens, the savings are real but modest. Caching becomes transformative at **high token counts and high request volumes**.
    
    ## Tracking Cache Performance
    
    Don't guess — measure. The API response includes detailed usage data:
    python

    def log_cache_metrics(response):

    usage = response.usage

    input_tokens = usage.input_tokens

    cache_writes = getattr(usage, 'cache_creation_input_tokens', 0)

    cache_reads = getattr(usage, 'cache_read_input_tokens', 0)

    total_effective_input = cache_writes 1.25 + cache_reads 0.1 + input_tokens

    cache_hit_rate = cache_reads / (cache_writes + cache_reads + 0.001) * 100

    print(f"Cache hit rate: {cache_hit_rate:.1f}%")

    print(f"Effective cost multiplier: {total_effective_input / (cache_writes + cache_reads + input_tokens):.2f}x")

    return {

    "cache_hit_rate": cache_hit_rate,

    "cache_writes": cache_writes,

    "cache_reads": cache_reads,

    "input_tokens": input_tokens

    }

    
    **Target benchmarks:**
    - Cache hit rate > 70%: Excellent — you've optimized well
    - Cache hit rate 40-70%: Good — room for improvement
    - Cache hit rate < 40%: Investigate why misses are high (TTL expiry, content variation, short sessions)
    
    ## Common Mistakes to Avoid
    
    **1. Putting dynamic content before static content**
    
    The cache key is prefix-based. If your dynamic user input comes before your system prompt in the message, the cache never hits.
    python

    Wrong — user input breaks cache key

    system = user_input + "\n\n" + static_instructions

    Right — static content first, dynamic content after

    system = static_instructions # cached

    messages = [{"role": "user", "content": user_input}] # not cached

    
    **2. Not accounting for the 5-minute TTL in low-frequency apps**
    
    If your app makes one request every 10 minutes per user, your cache will frequently expire between calls. You'll be paying cache write costs (1.25x) without getting cache read savings. Either batch requests, warm the cache proactively, or accept that caching isn't the right tool for that workload.
    
    **3. Caching content that actually changes**
    
    User preferences, real-time data, session state — these don't belong in the cached prefix. If you cache content that changes across requests, you'll get cache misses or — worse — stale context.
    
    **4. Forgetting to check cache metrics after deployment**
    
    Set up logging from day one. A silent cache miss in production means you're paying full price and not knowing it.
    
    **5. Over-caching short prompts**
    
    If your static section is under 500 tokens, the overhead of managing cache writes may not be worth the savings. Run the math for your specific token counts and request volumes.
    
    ## Advanced Pattern: Combining Caching with Batching
    
    For maximum cost reduction, combine prompt caching with the Messages Batching API:
    python

    import anthropic

    client = anthropic.Anthropic()

    Collect 10 customer questions before sending

    batch_requests = []

    for question in customer_questions_queue:

    batch_requests.append({

    "custom_id": f"question-{question.id}",

    "params": {

    "model": "claude-sonnet-4-6",

    "max_tokens": 512,

    "system": [

    {

    "type": "text",

    "text": product_manual_text,

    "cache_control": {"type": "ephemeral"}

    }

    ],

    "messages": [{"role": "user", "content": question.text}]

    }

    })

    Batching gives 50% discount on API pricing

    Caching gives 90% discount on input tokens

    Combined: very significant cost reduction

    batch = client.beta.messages.batches.create(requests=batch_requests)

    ```

    Batching and caching stack independently — the batch discount applies to the overall API pricing tier, while caching reduces the effective token cost within each request.

    Key Takeaways

    • Cache write cost is 1.25× normal input; cache read cost is 0.1× — you break even after 1-2 cache hits
    • 5-minute TTL — design your caching strategy around this for high-frequency workloads
    • Static prefix first — cache keys are prefix-based, so put documents/instructions before user content
    • Measure everythingcache_creation_input_tokens and cache_read_input_tokens in the API response tell you exactly what's happening
    • High ROI scenarios: large context documents (50K+ tokens), high request volume (100+/day), multi-turn conversations

    Next Steps

    Ready to dig deeper into Claude's API features?

    If you're preparing for the Claude Certified Architect certification, prompt caching is a topic you need to understand cold — both the implementation details and the cost model. Our CCA Practice Test Bank includes questions on caching strategies, API optimization, and cost management.

    Start practicing for free → Take our sample CCA questions and see where your Claude API knowledge stands.

    Ready to Start Practicing?

    300+ scenario-based practice questions covering all 5 CCA domains. Detailed explanations for every answer.

    Free CCA Study Kit

    Get domain cheat sheets, anti-pattern flashcards, and weekly exam tips. No spam, unsubscribe anytime.