Anthropic API Compatible Cache Instructions
Quick Start
Here’s a quick example of how to implement prompt caching in the Anthropic-compatible API using acache_control block:
JSON
cache_control parameter. This enables reuse of the large text across multiple API calls without reprocessing it each time. By changing only the user message, you can ask various questions about the book while utilizing the cached content, resulting in faster responses and reduced costs.
How Prompt Caching Works
When you send a request with prompt caching enabled:- The system checks if the prompt prefix before the specified cache breakpoint (cache_control) has been cached from a previous request.
- If found, it uses the cached version, significantly reducing processing time and costs.
- If not found, it processes the full prompt and caches it when generating the response.
- Prompts with many examples
- Large amounts of context or background information
- Repetitive tasks with consistent instructions
- Long multi-turn conversations
Supported Models and Pricing
Prompt caching introduces a differentiated pricing structure. The table below shows the price per million tokens for each supported model:| Model | Input | Output | Prompt caching Read | Prompt caching Write |
|---|---|---|---|---|
| MiniMax-M2 | $0.3 / M tokens | $1.2 / M tokens | $0.03 / M tokens | $0.375 / M tokens |
| MiniMax-M2-Stable High Concurrency, Commercial Use | $0.3 / M tokens | $1.2 / M tokens | $0.03 / M tokens | $0.375 / M tokens |
The table above reflects the following pricing multipliers for prompt caching:
- Cache write tokens are 1.25 times the base input tokens price
- Cache read tokens are 0.1 times the base input tokens price
How to Implement Prompt Caching
Structuring Your Prompt
Place static, reusable content (tool definitions, system instructions, examples, etc.) at the beginning of your prompt. Mark the end of the cacheable content using thecache_control parameter.
Cache prefixes are created in the following order: tools → system → messages. This order forms a hierarchy where each level builds upon the previous ones.
Automatic Prefix Checking
You can use just one cache breakpoint at the end of your static content, and the system will automatically find the longest matching prefix. Three core principles:-
Cache content is cumulative: When you mark a block with
cache_control, the cache content is generated from all previous blocks in sequence. This means each cache depends on all content that came before it. - Forward sequential checking: The system checks for cache hits by working forward from the explicit cache breakpoint, ensuring the longest possible cache is hit.
- 20-block lookback window: The system checks up to 20 blocks before each explicit cache breakpoint. If no match is found after checking 20 blocks, it stops and moves to the previous explicit breakpoint (if any).
cache_control at block 30 and make repeated requests:
- If no block content is modified, the system will hit the cache for all content from blocks 1-30.
- If block 25 is modified, the system searches forward from block 30 until it matches the cache at block 24, so blocks 1-24 will hit the cache.
- If block 5 is modified, the system searches forward from block 30 and still finds no match at block 11, so the cache becomes invalid for this request.
What Can Be Cached
Most blocks in the request can be designated for caching withcache_control, including:
- Tools: Tool definitions in the
toolsarray - System messages: Content blocks in the
systemarray - Text messages: Content blocks in the
messages.contentarray, for both user and assistant turns - Tool use and tool results: Tool_use and tool_result types in content blocks in the
messages.contentarray, for both user and assistant turns
cache_control to enable caching for that portion of the request.
Cache Invalidation
Modifications to cached content can invalidate some or all of the cache. As described in Structuring Your Prompt, the cache follows the hierarchy:tools → system → messages. Changes at each level invalidate that level and all subsequent levels.
Cache Performance
Monitor cache performance using the following API response fields in theusage object (or in the message_start event when streaming):
cache_creation_input_tokens: Number of tokens written to the cache when creating a new cache entry.cache_read_input_tokens: Number of tokens retrieved from the cache for this request.input_tokens: Number of input tokens not read from or used to create a cache (i.e., tokens after the last cache breakpoint).
Understanding Token CompositionTo calculate total input tokens:Breakdown by position:
cache_read_input_tokens: Tokens before the breakpoint, already cached (reads)cache_creation_input_tokens: Tokens before the breakpoint, being cached now (writes)input_tokens: Tokens after the last breakpoint (not eligible for caching)
cache_read_input_tokens: 100,000cache_creation_input_tokens: 0input_tokens: 50- Total input tokens: 100,050 tokens
input_tokens will typically be much smaller than your total input.Common Issues
If you’re experiencing unexpected caching behavior:- Content consistency: Verify that cached sections are identical across calls and marked with
cache_controlin the same locations - Cache expiration: Confirm that calls are made within the cache lifetime (5 minutes)
- Block count limit: For prompts with more than 20 content blocks, add additional
cache_controlparameters to ensure all content can be cached (the system automatically checks approximately 20 blocks before each breakpoint) - Inactive cache breakpoints: A call supports up to 4
cache_controlparameters. If more than 4 are specified, only the most recent 4 (from back to front) will be used
More Examples
The following code examples showcase various prompt caching patterns and demonstrate how to implement caching in different scenarios:Large context caching example
Large context caching example
input_tokens: Tokens in the user message onlycache_creation_input_tokens: Tokens in the entire system message, including the legal documentcache_read_input_tokens: 0 (no cache hit on first request)
input_tokens: Tokens in the user message onlycache_creation_input_tokens: 0 (no new cache creation)cache_read_input_tokens: Tokens in the entire cached system message
Caching tool definitions
Caching tool definitions
cache_control parameter is placed on the final tool (get_time) to designate all tools as part of the static prefix.All tool definitions, including get_weather and any other tools defined before get_time, will be cached as a single prefix.This approach is ideal when you have a consistent set of tools to reuse across multiple requests without reprocessing them each time.First request:input_tokens: Tokens in the user messagecache_creation_input_tokens: Tokens in all tool definitions and system promptcache_read_input_tokens: 0 (no cache hit on first request)
input_tokens: Tokens in the user messagecache_creation_input_tokens: 0 (no new cache creation)cache_read_input_tokens: Tokens in all cached tool definitions and system prompt
Ongoing multi-turn conversation
Ongoing multi-turn conversation
cache_control to enable incremental caching of the conversation. The system automatically looks up and uses the longest previously cached prefix for subsequent messages. Blocks previously marked with cache_control don’t need to be marked again—they will still result in cache hits (and cache refreshes) if accessed within 5 minutes.Note that cache_control is also placed on the system message. This ensures that if it gets evicted from the cache (after not being used for more than 5 minutes), it will be re-cached on the next request.This approach is ideal for maintaining context in ongoing conversations without repeatedly processing the same information.When set up correctly, you should see the following in the usage response for each request:input_tokens: Tokens in the new user message (typically minimal)cache_creation_input_tokens: Tokens in the new assistant and user turnscache_read_input_tokens: Tokens in the conversation up to the previous turn
Comprehensive use: Multiple cache breakpoints
Comprehensive use: Multiple cache breakpoints
- RAG applications with large document contexts
- Agent systems that use multiple tools
- Long-running conversations that maintain context
- Applications that need to optimize different parts of the prompt independently
OpenAI API Compatible Cache Instructions
Coming soon, stay tuned






