Prompt caching is a powerful feature that allows you to reuse large, static portions of your prompts across multiple API calls, reducing both costs and latency. Instead of sending the same context with every request, you send it once, cache it, and reference it for subsequent calls.
The Result? 90% cost reduction on cached content after the first request!
Using Claude 3.5 Haiku as an example:
Scenario | Token Count | Cost per Request | 100 Requests/Day | Monthly Cost |
---|---|---|---|---|
Without Caching | 10,000 | $0.008 | $0.80 | $24.00 |
With Caching | 10,000 | $0.0008* | $0.08 | $2.40 |
Savings | - | 90% | $0.72/day | $21.60/month |
*After initial cache write (which costs 25% more than base rate)
- First Request: Your large context is sent and cached (25% premium on token cost)
- Subsequent Requests: Only new content is sent, cached content is referenced (90% discount)
- Cache Duration: 5 minutes by default (refreshes with each use)
- Minimum Size: 1,024 tokens for Anthropic Claude
- Clone the repository:
git clone https://github.com/duanlightfoot/prompt-caching-basics.git
cd prompt-caching-basics
- Install dependencies:
pip install -r requirements.txt
- Set up your environment:
cp .env.example .env
# Edit .env and add your Anthropic API key
- Run the demo:
python prompt_caching_demo.py
prompt-caching-basics/
โโโ data/
โ โโโ sample_videos_metadata.json # Sample data (10 videos, ~8KB)
โโโ images/
โ โโโ banner.png # Repository banner
โโโ prompt_caching_demo.py # Main demonstration script
โโโ requirements.txt # Python dependencies
โโโ .env.example # Environment variables template
โโโ .gitignore # Git ignore file
โโโ README.md # This file
- Runs 4 different queries against the same cached data
- Shows real-time cost calculations
- Displays cache hit/miss status
- Calculates total savings
- Chat with the AI about the video data
- See caching in action with each message
- Watch costs drop after the first message
- Color-coded terminal output
- Clear cache hit/miss indicators
- Real-time cost breakdowns
- Token usage analysis
Here's the key implementation:
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=500,
system=[
{
"type": "text",
"text": "You are an AI assistant..." # Small context, no cache
},
{
"type": "text",
"text": f"# Large Data Context\n{json_data}", # Large context
"cache_control": {"type": "ephemeral"} # โ THE MAGIC PARAMETER
}
],
messages=[{"role": "user", "content": "Your question here"}]
)
- Large static contexts (documentation, knowledge bases)
- Repeated queries against the same data
- Conversational AI with consistent system prompts
- Batch processing within 5-minute windows
- Development and testing with the same prompts
- Small prompts (under 1,024 tokens)
- Constantly changing contexts
- One-off queries with unique data
- Infrequent API calls (more than 5 minutes apart)*
*Note: You can use 1-hour caching for less frequent calls (2x base rate)
First Request Cost = (tokens ร base_rate ร 1.25)
Cached Request Cost = (new_tokens ร base_rate) + (cached_tokens ร base_rate ร 0.1)
Savings = Original Cost - Cached Cost
For less frequent API calls, use 1-hour caching:
"cache_control": {"type": "ephemeral", "ttl": "1h"} # 1-hour cache
Cost: 2x base rate to write, but holds for 60 minutes.
You can cache different parts independently:
system=[
{
"type": "text",
"text": "Tool definitions...",
"cache_control": {"type": "ephemeral"} # Cache tools
},
{
"type": "text",
"text": "Static instructions...",
"cache_control": {"type": "ephemeral"} # Cache instructions
},
{
"type": "text",
"text": "Dynamic context..." # Don't cache changing data
}
]
==================================================
REQUEST #1 - CACHE ANALYSIS:
==================================================
๐ CACHE MISS! Creating new cache entry
- New tokens processed: 3,251
- Cache creation tokens: 3,180
๐ฐ CACHE CREATION COST:
- One-time cache write cost: $0.003180
- (Future requests will save 90%)
==================================================
REQUEST #2 - CACHE ANALYSIS:
==================================================
โ
CACHE HIT! Reusing previously cached content
- Cached tokens read: 3,180
- New tokens processed: 71
๐ฐ COST BREAKDOWN:
- Without cache: $0.002600
- With cache: $0.000311
- Saved: $0.002289 (88.0%)
- Massive Cost Savings: 90% reduction on repeated API calls
- Improved Latency: Faster responses on cached content
- Better UX: More responsive applications
- Scalability: Make AI features financially viable at scale
- Simple Implementation: One parameter change
- Get an Anthropic API key from console.anthropic.com
- Clone this repository
- Install dependencies
- Add your API key to
.env
- Run the demo
- Implement in your own projects
- Save money! ๐ฐ
- Anthropic Prompt Caching Documentation
- Claude API Pricing
- OpenAI Prompt Caching
- Blog Post: From $720 to $72 Monthly
Contributions are welcome! Feel free to:
- Open issues for bugs or features
- Submit pull requests
- Share your caching strategies
- Report your cost savings
MIT License - feel free to use this in your projects!
Du'An Lightfoot
- GitHub: @duanlightfoot
- LinkedIn: duanlightfoot
- YouTube: LabEveryday
- Anthropic for implementing prompt caching
- The AI community for sharing cost optimization strategies
- Everyone who's overpaid for API calls (we've all been there!)
Remember: Every API call without caching is money left on the table. Start caching today! ๐