📝 LLM & AI

Prompt Caching in the Claude API: Saving on Tokens

Author

Pyland

📅

Published

30.06.2026

⏱️

Reading time

1 min

👁️

Views

📊

Level

Article

#llm

Every request to Claude costs money. input_tokens includes everything: the system prompt, conversation history, and the current question. If you make 100 requests with the same 500-token system prompt, you’re paying for 50,000 tokens needlessly.

Prompt caching stores part of the prompt on Anthropic’s side. Repeated requests that hit the cached content cost 10% of the normal price.

How to Enable Caching

Change system from a plain string to a list with cache_control:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Ты — Python-тьютор. Объясняешь просто...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Что такое list comprehension?"}]
)

Reading Cache Metrics

usage = response.usage
print(f"Обычные входящие токены:    {usage.input_tokens}")
print(f"Создан кеш (1.25x цена):   {usage.cache_creation_input_tokens}")
print(f"Прочитан кеш (0.1x цена):  {usage.cache_read_input_tokens}")

First request — cache_creation_input_tokens > 0 (cache is created, costs 1.25x)
Subsequent requests — cache_read_input_tokens > 0 (cache is read, costs 0.1x)

Batch Processing — Where Caching Really Pays Off

texts = ["Текст 1", "Текст 2", "Текст 3", ...]

SYSTEM = [{"type": "text", "text": "Длинный system prompt...", "cache_control": {"type": "ephemeral"}}]

total_saved = 0
for text in texts:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=SYSTEM,
        messages=[{"role": "user", "content": text}]
    )
    # From the second request onward, the cache is already being read
    saved = response.usage.cache_read_input_tokens
    total_saved += saved

print(f"Сэкономлено токенов через кеш: {total_saved}")

When the Cache Is Created vs Read

Request	cache_creation	cache_read	Cost
1st	> 0	0	1.25x for the cached portion
2nd+	0	> 0	0.1x for the cached portion

What Can Be Cached

System prompt (the most common case)
Long documents at the start of messages
Few-shot examples

Minimum size for caching: 1024 tokens (for claude-sonnet-4-6).

Calculating Real Savings

def calc_cost(usage, model="sonnet"):
    rates = {"sonnet": (3.0, 15.0, 3.75, 0.3)}  # in, out, cache_create, cache_read per 1M
    r = rates[model]
    cost = (
        usage.input_tokens * r[0] +
        usage.output_tokens * r[1] +
        (usage.cache_creation_input_tokens or 0) * r[2] +
        (usage.cache_read_input_tokens or 0) * r[3]
    ) / 1_000_000
    return cost

Prompt Caching in the Claude API: Saving on Tokens

How to Enable Caching

Reading Cache Metrics

Batch Processing — Where Caching Really Pays Off

When the Cache Is Created vs Read

What Can Be Cached

Calculating Real Savings

Your reaction to the article

Embeddings: Coordinates of Text in Semantic Space

RAG: Chatting with Documents via Vector Search

💬 Comments (0)

No comments yet

Similar articles

httpx: A Modern HTTP Client for Python

AI Agents: ReAct Loop and Autonomous Actions

Typer: CLI Applications Without the Boilerplate

Did you like the article?

Similar articles

httpx: A Modern HTTP Client for Python
📅 30.06.2026 👁️ 108

AI Agents: ReAct Loop and Autonomous Actions
📅 30.06.2026 👁️ 107

Typer: CLI Applications Without the Boilerplate
📅 30.06.2026 👁️ 88