My Anthropic API bill dropped 70 percent last month and I did not change a single model. I changed where the cache breakpoints went. Here are the core patterns I now use on every Claude integration I ship.
Pattern 1: Cache The System Prompt First
The system prompt is the cheapest win and most people skip it. My agents run with a 4,000 token system prompt that explains the role, the output format, the safety rules, and a few examples. That prompt never changes inside a session. Before caching, I paid full input price for those 4,000 tokens on every single call. With an agent that loops 30 times to finish a task, that is 120,000 tokens of pure repetition.
The fix is one parameter. I add a cache_control block with type: "ephemeral" to the last content item in the system prompt array. The first call writes the cache and costs slightly more (cache writes carry a small premium). Every call after that reads the cache at roughly one-tenth the input price.
Here is the rule I follow: the cached block has to be at least 1,024 tokens for Claude Sonnet, or it gets ignored silently. My 4,000 token prompt clears that easily. If your system prompt is short, this pattern does nothing, so do not bother adding the breakpoint to a 200 token instruction.
The order matters more than people expect. The cache works as a prefix. Everything before the breakpoint gets stored. Everything after it is read fresh. So I put the stable stuff (role, rules, examples) up top and the volatile stuff (user query, current date) down below the breakpoint. Reorder this wrong and your cache hit rate collapses because the prefix changes on every call.
One real number from my logs: a document-classification job that runs 2,000 times a day. The system prompt is 3,800 tokens. Caching it saved me around 6.8 million billed input tokens daily. That is the single largest line item I cut. If you only do one thing from this article, cache the system prompt.
Pattern 2: Cache Tool Definitions Separately
Tool definitions are the second forgotten cost. If you run Claude as an agent with function calling, your tool schemas can be huge. I have one agent with 14 tools and the JSON schemas alone come to 9,200 tokens. Those definitions sit at the very front of the request, before the system prompt, before anything else. They are also completely static across a session.
You can place a cache_control breakpoint right after the last tool definition. Now the tool block and the system prompt both get cached as separate prefixes, allowing you to maximize efficiency across multi-turn sessions.
[AgentUpdate Depth Analysis] Prompt caching is not merely a cost-saving feature; it represents a fundamental shift in the architectural feasibility of multi-turn AI Agent workflows. By offering explicit cache control via ephemeral breakpoints, Anthropic grants developers surgical precision compared to OpenAI's automatic, black-box caching. In agentic loops and MCP-heavy environments, tool schemas and system instructions often dominate over 80% of the token payload. Stratifying caches enables high-frequency, long-context agents to run with drastically reduced latency (TTFT) and operational costs. This architectural paradigm shift signals that the future of LLM application development is transitioning from naive prompt engineering to highly optimized state and context management engineering.