A developer-centric approach to managing context and documents in Large Language Model applications
// Cache a document once
"[System Cache: legal_doc] Here's the contract: [DOCUMENT CONTENT]"
// Reference it in conversations  
"[System Cache Reference: legal_doc] What are the key terms?"
"[System Cache Reference: legal_doc] Are there any liability clauses?"
// Clean up when done
"[System Clean Cache: legal_doc]"- Explicit Control: Developers manage cache lifecycle directly
- Session Isolation: Clean session boundaries prevent cache pollution
- Syntax-Driven: Simple, intuitive syntax for cache operations
- Framework Agnostic: Works with any LLM inference engine
- Mobile Optimized: Particularly valuable for resource-constrained devices
Current LLM applications face major challenges:
- Bandwidth Waste: Resending large documents with every request
- Cost Inefficiency: Paying to reprocess the same content repeatedly
- Poor Mobile Performance: Memory constraints make document handling difficult
- Cache Opacity: No control over what gets cached or when it's cleared
- Inconsistent APIs: Different caching approaches across inference engines
Explicit cache management through intuitive syntax:
// Session management
"[System Start Session]"                    // Clean session start
"[System Clean Cache]"                      // Clear all caches
// Content caching  
"[System Cache: doc1] [LARGE DOCUMENT]"     // Cache with ID
"[System Cache Reference: doc1] Question?"  // Reference cached content
// Advanced operations
"[System Cache: temp, ttl: 3600] content"   // Cache with expiration
"[System Cache Reference: doc1,doc2] ?"     // Multi-document queries
"[System Clean Cache: doc1,doc2]"           // Selective cleanup- Core Specification - Complete technical spec
- Getting Started - Quick implementation guide
- API Reference - Full syntax reference
- Use Cases - Common patterns and examples
- Python Reference - Complete Python implementation
- JavaScript/Node.js - Web and Node.js support
- Integration Guides - llama.cpp, vLLM, MLC-LLM
[Developer App]
       ↓
[Cache Parser] ← Processes [System Cache: id] syntax
       ↓
[Cache Manager] ← Stores content + computed KV states
       ↓
[Inference Engine] ← llama.cpp, vLLM, MLC-LLM, etc.
       ↓
[LLM Model] ← Unchanged model weights
Document Q&A Example:
- Without caching: 50KB document × 10 queries = 500KB transferred
- With explicit caching: 50KB document × 1 + 10 small queries = ~55KB transferred
- Bandwidth savings: ~90%
- Response time: 2-5x faster (cached KV states)
- API costs: Significantly reduced
Current Phase: Core Specification & Reference Implementation
Next Phase: Academic publication (COLM 2025)
Target: Industry adoption across LLM inference engines
We envision a future where:
- Every LLM inference engine supports explicit cache management
- Developers have fine-grained control over context and memory
- Mobile LLM apps can efficiently handle large documents
- Cache management becomes a standard part of LLM application architecture
We welcome contributions! Areas where we need help:
- Reference implementations for different inference engines
- Performance benchmarks and comparisons
- Mobile optimization strategies
- Security and isolation enhancements
- Documentation and examples
See CONTRIBUTING.md for detailed guidelines.
- Discussions: GitHub Discussions
- Issues: Bug reports and feature requests
- Twitter: Updates and announcements [@your-handle]
MIT License - see LICENSE file for details.
This project was inspired by the challenges faced by developers building document-heavy LLM applications and the need for standardized cache management across inference engines.
"Explicit is better than implicit" - The Zen of Python
This project addresses the lack of explicit cache management in current LLM inference systems, providing developers with fine-grained control over context caching for improved performance and cost efficiency.