-
Notifications
You must be signed in to change notification settings - Fork 634
Description
Use case
Summary
Introduce a prompt caching option that allows caching of the static portion of prompts, documents, and queries to optimize performance, reduce latency, and minimize costs.
Motivation
When building applications that rely on large prompt templates, many parts of the prompt (system instructions, reference documents, metadata, etc.) remain static across multiple requests. Currently, these repeated static tokens are re-sent and re-processed for every query, which:
Increases latency due to redundant processing.
Leads to higher costs since repeated tokens contribute to billable usage.
Adds unnecessary overhead when only the user’s dynamic query changes.
Solution/User Experience
Proposed Solution
-
Provide an opt-in cache_control flag (or similar) in the API to enable prompt caching.
-
Allow cache checkpoints to be placed in the prompt: these mark the end of the static portion (the prefix) that can be cached.
-
Only cache the prefix if it meets a minimum token count requirement.
Key Details
API Sketch
{ "model": "...", "use_prompt_cache": true, "cache_checkpoints": [ { "location": "system", "after_message_index": 0 } ], "messages": [ { "role": "system", "content": "Static instructions..." }, { "role": "system", "content": { "text": "Long document content...", "cache_control": { "type": "checkpoint" } } }, { "role": "user", "content": "User query here" } ] }
Benefits
-
Lower latency (skip re-processing static content)
-
Reduced cost (fewer input tokens billable)
-
Better performance for use-cases with big or repeated context (docs, few-shot examples, system prompts)