Claude Cache Diagnostics: Finally Know Why Your Prompt Cache Is Missing

If you've ever watched cache_read_input_tokens drop to zero between turns — with no error, no log, no indication of what changed — you know the frustration. Claude's prompt cache saves up to 90% on repeated input tokens, but only when your prompt prefix is byte-for-byte identical to the previous request. One reordered tool definition, one timestamp interpolated into your system prompt, or one edit to an earlier message is enough to silently bust the cache.

Anthropic just shipped the fix: cache diagnostics, now in public beta. Pass the ID of your previous response, and the API tells you exactly where the two requests diverged — model, system prompt, tools, or messages — so you can fix the root cause instead of guessing.

This guide covers everything you need to enable it, read the results, and stop losing money to silent cache misses.

What Is Claude Cache Diagnostics?

Cache diagnostics is a new beta feature on the Claude API that compares two consecutive requests and reports the first point of divergence in their prompt prefixes. It launched as part of the Claude Platform improvements announced at Code with Claude London (May 19, 2026).

The feature works by storing a lightweight fingerprint of each request, keyed by the response id. On your next request, you pass that id as diagnostics.previous_message_id. The API rebuilds a fingerprint for the new request, compares it against the stored one, and attaches a diagnostics object to the response describing the earliest divergence.

Important constraints to know upfront:

Available on the Claude API only — not Amazon Bedrock or Vertex AI
Requires the beta header: anthropic-beta: cache-diagnosis-2026-04-07
Fingerprints contain only cryptographic hashes and token-count estimates — never raw prompt content
Scoped to your organization and workspace; fingerprints expire after a short time
Zero Data Retention (ZDR) eligible

Without cache diagnostics, the only signal that something went wrong was usage.cache_read_input_tokens dropping to zero. Now you get a precise cache_miss_reason with the type of divergence and an estimate of how many tokens were lost.

The 6 Cache Miss Reason Types — And What to Do About Each

When cache diagnostics detects a divergence, it returns a cache_miss_reason object with a type field and a cache_missed_input_tokens estimate. The API reports only the first divergence point — fix it, then check again for downstream issues.

Type	What happened	How to fix it
`model_changed`	The `model` field differed between turns — common with routers, A/B tests, or fallback logic	Lock the model constant for the lifetime of a cached conversation
`system_changed`	The `system` prompt changed — usually a timestamp, request ID, or user ID was interpolated	Make the system prompt a byte-stable constant; move dynamic data into the first `user` message after your cache breakpoint
`tools_changed`	The `tools` array was added to, removed from, reordered, or serialized non-deterministically	Send the same tool list in a fixed order on every turn; sort keys before JSON serialization
`messages_changed`	The model/system/tools matched, but an earlier message was edited, reordered, or truncated instead of appended	Treat conversation history as append-only; echo assistant `content` and tool results back verbatim
`previous_message_not_found`	No fingerprint exists for the supplied ID — the prior request may have lacked the beta header, came from a different workspace, or the fingerprint expired	Send the beta header on every turn; keep consecutive turns close together in time
`unavailable`	Diagnostic info was not available — often because `tool_choice`, `thinking`, `context_management`, or other prompt-affecting parameters changed, or the conversation is very long	Keep all prompt-affecting parameters constant for the full conversation lifetime

The cache_missed_input_tokens field gives you a magnitude estimate of tokens lost — treat it as directional, not a billing number, since it's derived from byte lengths before tokenization.

How to Enable Cache Diagnostics: Step-by-Step

Step 1: Add the Beta Header

Every request that participates in diagnostics must include:

anthropic-beta: cache-diagnosis-2026-04-07

Step 2: Set Up the First Turn

On the first turn, pass previous_message_id: null to opt in without a prior message to compare against:

pythonimport anthropic

client = anthropic.Anthropic()

SYSTEM = "You are a document analysis assistant. <document>...</document>"

# Turn 1: establish the cache, opt in to diagnostics
r1 = client.beta.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system=SYSTEM,
    messages=[{"role": "user", "content": "Summarize section 1."}],
    diagnostics={"previous_message_id": None},
    betas=["cache-diagnosis-2026-04-07"],
)
print(f"Turn 1 ID: {r1.id}")

Step 3: Chain Subsequent Turns

Pass the previous response's id as previous_message_id on every subsequent turn:

python# Turn 2: reference the previous response ID
r2 = client.beta.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system=SYSTEM,
    messages=[
        {"role": "user", "content": "Summarize section 1."},
        {"role": "assistant", "content": r1.content},
        {"role": "user", "content": "Now summarize section 2."},
    ],
    diagnostics={"previous_message_id": r1.id},
    betas=["cache-diagnosis-2026-04-07"],
)

# Check the result
diagnostics = r2.diagnostics
if diagnostics is None:
    print("No divergence detected — cache is working.")
elif diagnostics.cache_miss_reason is None:
    print("Comparison still pending — check next turn.")
else:
    print(f"Cache miss reason: {diagnostics.cache_miss_reason.type}")
    print(f"Tokens missed: {diagnostics.cache_miss_reason.cache_missed_input_tokens}")

TypeScript Example

typescriptimport Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();
const SYSTEM = "You are a document analysis assistant. <document>...</document>";

// Turn 1
const r1 = await client.beta.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  cache_control: { type: "ephemeral" },
  system: SYSTEM,
  messages: [{ role: "user", content: "Summarize section 1." }],
  diagnostics: { previous_message_id: null },
  betas: ["cache-diagnosis-2026-04-07"]
});

// Turn 2
const r2 = await client.beta.messages.create({
  model: "claude-opus-4-7",
  max_tokens: 1024,
  cache_control: { type: "ephemeral" },
  system: SYSTEM,
  messages: [
    { role: "user", content: "Summarize section 1." },
    { role: "assistant", content: r1.content },
    { role: "user", content: "Now summarize section 2." }
  ],
  diagnostics: { previous_message_id: r1.id },
  betas: ["cache-diagnosis-2026-04-07"]
});

if (r2.diagnostics === null) {
  console.log("Cache working correctly.");
} else if (r2.diagnostics.cache_miss_reason === null) {
  console.log("Comparison pending.");
} else {
  console.log(`Miss reason: ${r2.diagnostics.cache_miss_reason.type}`);
}

Infographic listing the four reasons Claude prompt caches miss: model change, system prompt drift, reordered tool definitions, and edited earlier messages

Reading Diagnostics Alongside Usage Data

Cache diagnostics answers "did my request change?" while usage.cache_read_input_tokens answers "did the cache hit?" Combining them gives you the full picture:

`diagnostics` result	`cache_read_input_tokens`	What it means
`null`	High	Working perfectly. Prefix is stable, cache hit.
`null`	Low or zero	Cache expired. Your requests match but the entry was evicted. Consider the 1-hour cache TTL option.
`*_changed` type	Low or zero	Your bug. The request changed; fix the issue indicated by `type`.
`*_changed` type	High	Minor issue. Change occurred late in the prompt but an earlier breakpoint still hit. Worth fixing, low urgency.

The most common scenario for teams adding diagnostics for the first time is system_changed. The usual culprit: something like f"You are an assistant. Today is {datetime.now()}." in the system prompt. Timestamps, user IDs, trace IDs, and session tokens are all cache killers when embedded in the system prompt. Move them to the first user message instead.

Multi-Turn Conversation Loop Pattern

For production applications with long conversations, thread previous_message_id through a loop and log any divergences:

pythonimport anthropic

client = anthropic.Anthropic()
SYSTEM = "You are a document analysis assistant. <document>...</document>"

messages = []
prev_id = None

for i, user_message in enumerate([
    "Summarize section 1.",
    "Now section 2.",
    "Now section 3."
]):
    messages.append({"role": "user", "content": user_message})

    r = client.beta.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        cache_control={"type": "ephemeral"},
        system=SYSTEM,
        messages=messages,
        diagnostics={"previous_message_id": prev_id},
        betas=["cache-diagnosis-2026-04-07"],
    )

    # Log any cache misses immediately
    if r.diagnostics is not None and r.diagnostics.cache_miss_reason is not None:
        reason = r.diagnostics.cache_miss_reason
        print(f"[Turn {i+1}] Cache miss: {reason.type} "
              f"(~{reason.cache_missed_input_tokens} tokens lost)")

    messages.append({"role": "assistant", "content": r.content})
    prev_id = r.id
    
print(f"Cache tokens used: {r.usage.cache_read_input_tokens}")

Run this in your staging environment against representative conversation flows before deploying. You'll often find issues in minutes that would have taken hours of log analysis to track down.

Why This Matters for Your API Bill

Prompt caching on Claude saves up to 90% on repeated input tokens — a write costs 25% extra, but reads cost just 10% of the base price. For applications with large system prompts or long document contexts, the savings compound fast:

A 50,000-token system prompt at claude-opus-4-7 pricing: ~$0.75 per uncached request vs ~$0.075 per cached read
In a 100-turn conversation: $75 uncached vs $7.50 cached — a 90% cost reduction
One silent system_changed miss per turn across 1,000 daily users: $67,500 in unnecessary spend per month

Cache diagnostics turns a debugging nightmare into a five-minute fix. Enable it in staging, find the divergence type, fix the root cause, and confirm with the next run.

Key Takeaways

Claude cache diagnostics is in public beta — enable it with anthropic-beta: cache-diagnosis-2026-04-07
Pass diagnostics.previous_message_id on every turn: null on the first, the previous response ID on all subsequent turns
The API returns one of six cache_miss_reason types: model_changed, system_changed, tools_changed, messages_changed, previous_message_not_found, or unavailable
system_changed is the most common culprit — timestamps and dynamic values in your system prompt will bust the cache every time
cache_missed_input_tokens gives you a rough estimate of tokens lost per miss — use it to prioritize fixes
The feature qualifies for Zero Data Retention: only hashes and token counts are stored, never raw prompt content
Currently Claude API only — not available on Amazon Bedrock or Vertex AI

Next Steps

If you're building production applications on the Claude API, understanding prompt caching deeply is one of the highest-leverage optimizations available. Our Claude API Prompt Caching Guide covers the fundamentals — breakpoint placement, TTL options, multi-breakpoint strategies, and how to structure long-document workflows for maximum cache efficiency.

Ready to go deeper on Claude API development? The AI for Anything CCA-F Practice Test Bank includes hands-on questions covering the Claude Platform, caching strategies, agent architecture, and everything else on the Claude Certified Architect exam. Start with a free sample and see how your API knowledge stacks up.

Sources: Cache Diagnostics — Claude API Docs · Prompt Caching — Claude API Docs · Claude Platform Release Notes · Code with Claude London 2026

Claude Cache Diagnostics: Debug Prompt Cache Misses and Slash API Costs