📝 LLM & AI

Prompt Caching in the Claude API: Saving on Tokens

P
Author
Pyland
📅
Published
30.06.2026
⏱️
Reading time
1 min
👁️
Views
95
📊
Level
Article

Every request to Claude costs money. input_tokens includes everything: the system prompt, conversation history, and the current question. If you make 100 requests with the same 500-token system prompt, you’re paying for 50,000 tokens needlessly.

Prompt caching stores part of the prompt on Anthropic’s side. Repeated requests that hit the cached content cost 10% of the normal price.

How to Enable Caching

Change system from a plain string to a list with cache_control:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Ты — Python-тьютор. Объясняешь просто...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Что такое list comprehension?"}]
)

Reading Cache Metrics

usage = response.usage
print(f"Обычные входящие токены:    {usage.input_tokens}")
print(f"Создан кеш (1.25x цена):   {usage.cache_creation_input_tokens}")
print(f"Прочитан кеш (0.1x цена):  {usage.cache_read_input_tokens}")
  • First requestcache_creation_input_tokens > 0 (cache is created, costs 1.25x)
  • Subsequent requestscache_read_input_tokens > 0 (cache is read, costs 0.1x)

Batch Processing — Where Caching Really Pays Off

texts = ["Текст 1", "Текст 2", "Текст 3", ...]

SYSTEM = [{"type": "text", "text": "Длинный system prompt...", "cache_control": {"type": "ephemeral"}}]

total_saved = 0
for text in texts:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=SYSTEM,
        messages=[{"role": "user", "content": text}]
    )
    # From the second request onward, the cache is already being read
    saved = response.usage.cache_read_input_tokens
    total_saved += saved

print(f"Сэкономлено токенов через кеш: {total_saved}")

When the Cache Is Created vs Read

Request cache_creation cache_read Cost
1st > 0 0 1.25x for the cached portion
2nd+ 0 > 0 0.1x for the cached portion

What Can Be Cached

  • System prompt (the most common case)
  • Long documents at the start of messages
  • Few-shot examples

Minimum size for caching: 1024 tokens (for claude-sonnet-4-6).

Calculating Real Savings

def calc_cost(usage, model="sonnet"):
    rates = {"sonnet": (3.0, 15.0, 3.75, 0.3)}  # in, out, cache_create, cache_read per 1M
    r = rates[model]
    cost = (
        usage.input_tokens * r[0] +
        usage.output_tokens * r[1] +
        (usage.cache_creation_input_tokens or 0) * r[2] +
        (usage.cache_read_input_tokens or 0) * r[3]
    ) / 1_000_000
    return cost

Your reaction to the article

💬 Comments (0)

🔐 Sign in to leave a comment
🚪 Login
💭

No comments yet

Be the first to share your opinion about this article!

🔗 Similar

Similar articles

Continue learning with these materials

📝

httpx: A Modern HTTP Client for Python

httpx is a next-generation HTTP client. Its interface is similar to requests, but it supports...

📅 30.06.2026 👁️ 108
📝

AI Agents: ReAct Loop and Autonomous Actions

A chatbot answers questions. An agent takes action: it calls tools, retrieves real data, and...

📅 30.06.2026 👁️ 107
📝

Typer: CLI Applications Without the Boilerplate

Typer builds CLIs from Python type annotations. No argparse, no manual parsing — just decorators...

📅 30.06.2026 👁️ 88

Did you like the article?

Subscribe to our updates and receive new articles first. Grow with PyLand!