Skip to main content

Features

  • Automatic Caching: Passive caching that automatically identifies repeated context content without changing API call methods (In contrast, the caching mode that requires explicitly setting parameters in the Anthropic API is called “Explicit Prompt Caching”, see Explicit Prompt Caching (Anthropic API))
  • Cost Reduction: Input tokens that hit the cache are billed at a lower price, significantly saving costs
  • Speed Improvement: Reduces processing time for repeated content, accelerating model response
This mechanism is particularly suitable for the following scenarios:
  • System prompt reuse: In multi-turn conversations, system prompts typically remain unchanged
  • Fixed tool lists: Tools used in a category of tasks are often consistent
  • Multi-turn conversation history: In complex conversations, historical messages often contain a lot of repeated information
Scenarios that meet the above conditions can effectively save token consumption and speed up response times using the caching mechanism.

Code Examples

Install SDK
pip install anthropic
Environment Variable Setup
export ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic
export ANTHROPIC_API_KEY=${YOUR_API_KEY}
First Request - Establish Cache
  import anthropic

  client = anthropic.Anthropic()

response1 = client.messages.create(
      model="MiniMax-M2.1",
    system="You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
      messages=[
          {
              "role": "user",
              "content": [
                  {
                      "type": "text",
                    "text": "<the entire contents of 'Pride and Prejudice'>"
                }
            ]
        },
    ],
    max_tokens=10240,
)

print("First request result:")
for block in response1.content:
    if block.type == "thinking":
        print(f"Thinking:\n{block.thinking}\n")
    elif block.type == "text":
        print(f"Output:\n{block.text}\n")
print(f"Input Tokens: {response1.usage.input_tokens}")
print(f"Output Tokens: {response1.usage.output_tokens}")
print(f"Cache Hit Tokens: {response1.usage.cache_read_input_tokens}")

Second Request - Reuse Cache
response2 = client.messages.create(
    model="MiniMax-M2.1",
    system="You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
    messages=[
          {
              "role": "user",
              "content": [
                  {
                      "type": "text",
                    "text": "<the entire contents of 'Pride and Prejudice'>"
                }
            ]
        },
    ],
    max_tokens=10240,
)

print("\nSecond request result:")
for block in response2.content:
    if block.type == "thinking":
        print(f"Thinking:\n{block.thinking}\n")
    elif block.type == "text":
        print(f"Output:\n{block.text}\n")
print(f"Input Tokens: {response2.usage.input_tokens}")
print(f"Output Tokens: {response2.usage.output_tokens}")
print(f"Cache Hit Tokens: {response2.usage.cache_read_input_tokens}")
Response includes context cache token usage information:
{
    "usage": {
        "input_tokens": 108,
        "output_tokens": 91,
        "cache_creation_input_tokens": 0,
        "cache_read_input_tokens": 14813
    }
}

Important Notes

  • Caching applies to API calls with 512 or more input tokens
  • Caching uses prefix matching, constructed in the order of “tool list → system prompts → user messages”. Changes to any module’s content may affect caching effectiveness

Best Practices

  • Place static or repeated content (including tool list, system prompts, user messages) at the beginning of the conversation, and put dynamic user information at the end of the conversation to maximize cache utilization
  • Monitor cache performance through the usage tokens returned by the API, and regularly analyze to optimize your usage strategy

Pricing

Prompt caching uses differentiated pricing:
  • Cache hit tokens: Billed at discounted price
  • New input tokens: Billed at standard input price
  • Output tokens: Billed at standard output price
See the Pricing page for details.
Pricing example:
Assuming standard input price is $10/1M tokens, standard output price is $40/1M tokens, cache hit price is $1/1M tokens:

Single request token usage details:
- Total input tokens: 50000
- Cache hit tokens: 45000
- New input content tokens: 5000
- Output tokens: 1000

Billing calculation:
- New input content cost: 5000 × 10/1000000 = $0.05
- Cache cost: 45000 × 1/1000000 = $0.045
- Output cost: 1000 × 40/1000000 = $0.04
- Total cost: 0.05 + 0.045 + 0.04 = $0.135

Compared to no caching (50000 × 10/1000000 + 1000 × 40/1000000 = $0.54), saves 75%

Further Reading

Cache Comparison

Prompt Caching (Passive)Explicit Prompt Caching (Anthropic API)
UsageAutomatically identifies and caches repeated contentExplicitly set cache_control in API
BillingCache hit tokens billed at discounted price
No additional charge for cache writes
Cache hit tokens billed at discounted price
First-time cache writes incur additional charges
ExpirationExpiration time automatically adjusted based on system load5-minute expiration, automatically renewed with continued use
Supported ModelsMiniMax-M2.1 seriesMiniMax-M2.1 series
MiniMax-M2 series