Automatic Caching: Passive caching that automatically identifies repeated context content without changing API call methods (In contrast, the caching mode that requires explicitly setting parameters in the Anthropic API is called “Explicit Prompt Caching”, see Explicit Prompt Caching (Anthropic API))
Cost Reduction: Input tokens that hit the cache are billed at a lower price, significantly saving costs
Speed Improvement: Reduces processing time for repeated content, accelerating model response
This mechanism is particularly suitable for the following scenarios:
System prompt reuse: In multi-turn conversations, system prompts typically remain unchanged
Fixed tool lists: Tools used in a category of tasks are often consistent
Multi-turn conversation history: In complex conversations, historical messages often contain a lot of repeated information
Scenarios that meet the above conditions can effectively save token consumption and speed up response times using the caching mechanism.
import anthropic client = anthropic.Anthropic()response1 = client.messages.create( model="MiniMax-M2.7", system="You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n", messages=[ { "role": "user", "content": [ { "type": "text", "text": "<the entire contents of 'Pride and Prejudice'>" } ] }, ], max_tokens=10240,)print("First request result:")for block in response1.content: if block.type == "thinking": print(f"Thinking:\n{block.thinking}\n") elif block.type == "text": print(f"Output:\n{block.text}\n")print(f"Input Tokens: {response1.usage.input_tokens}")print(f"Output Tokens: {response1.usage.output_tokens}")print(f"Cache Hit Tokens: {response1.usage.cache_read_input_tokens}")
Second Request - Reuse Cache
response2 = client.messages.create( model="MiniMax-M2.7", system="You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n", messages=[ { "role": "user", "content": [ { "type": "text", "text": "<the entire contents of 'Pride and Prejudice'>" } ] }, ], max_tokens=10240,)print("\nSecond request result:")for block in response2.content: if block.type == "thinking": print(f"Thinking:\n{block.thinking}\n") elif block.type == "text": print(f"Output:\n{block.text}\n")print(f"Input Tokens: {response2.usage.input_tokens}")print(f"Output Tokens: {response2.usage.output_tokens}")print(f"Cache Hit Tokens: {response2.usage.cache_read_input_tokens}")
Response includes context cache token usage information:
from openai import OpenAIclient = OpenAI()response1 = client.chat.completions.create( model="MiniMax-M2.7", messages=[ {"role": "system", "content": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"}, {"role": "user", "content": "<the entire contents of 'Pride and Prejudice'>"}, ], # Set reasoning_split=True to separate thinking content into reasoning_details field extra_body={"reasoning_split": True},)print("First request result:")print(f"Response: {response1.choices[0].message.content}")print(f"Total Tokens: {response1.usage.total_tokens}")print(f"Cached Tokens: {response1.usage.prompt_tokens_details.cached_tokens if hasattr(response1.usage, 'prompt_tokens_details') else 0}")
Second Request - Reuse Cache
response2 = client.chat.completions.create( model="MiniMax-M2.7", messages=[ {"role": "system", "content": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"}, {"role": "user", "content": "<the entire contents of 'Pride and Prejudice'>"}, ], # Set reasoning_split=True to separate thinking content into reasoning_details field extra_body={"reasoning_split": True},)print("\nSecond request result:")print(f"Response: {response2.choices[0].message.content}")print(f"Total Tokens: {response2.usage.total_tokens}")print(f"Cached Tokens: {response2.usage.prompt_tokens_details.cached_tokens if hasattr(response2.usage, 'prompt_tokens_details') else 0}")
Response includes context cache token usage information:
Caching applies to API calls with 512 or more input tokens
Caching uses prefix matching, constructed in the order of “tool list → system prompts → user messages”. Changes to any module’s content may affect caching effectiveness
Place static or repeated content (including tool list, system prompts, user messages) at the beginning of the conversation, and put dynamic user information at the end of the conversation to maximize cache utilization
Monitor cache performance through the usage tokens returned by the API, and regularly analyze to optimize your usage strategy