A high-performance, maintainable microservice built with Gin framework that provides OpenAI-compatible /v1/chat/completions
endpoint with intelligent prompt compression using RAG (Retrieval-Augmented Generation).
- OpenAI Compatibility: Fully compatible with OpenAI's chat completions API
- Intelligent Compression: Automatically compresses long prompts using semantic search and summarization
- Streaming Support: Supports both streaming and non-streaming responses
- Token Management: Built-in token counting and threshold-based compression
- Comprehensive Logging: Detailed logging with async processing and Loki integration
- Design Patterns: Implements Strategy, Factory, and Decorator patterns for maintainability
- High Performance: Built with Gin framework with proper dependency injection
- Handler Layer: HTTP request handling with OpenAI compatibility
- Logic Layer: Business logic implementation with comprehensive logging
- Strategy Layer: Pluggable prompt processing strategies (direct vs compression)
- Client Layer: External service communication (LLM, Semantic Search)
- Service Layer: Background services (logging, classification, Loki upload)
- Model Layer: Data structures and logging models
- Strategy Pattern: Different prompt processing strategies based on token count
- Factory Pattern: Creates appropriate processors and clients
- Decorator Pattern: Middleware for logging and metrics
- Go 1.22+
# Clone the repository
git clone https://github.com/zgsm-ai/chat-rag.git
cd chat-rag
# Bootstrap the project (installs tools, generates code, builds)
make bootstrap
# Or step by step:
make setup # Generate API code and download deps
make build # Build the application
Edit etc/chat-api.yaml
:
Name: chat-rag
Host: 0.0.0.0
Port: 8080
# Model endpoints
MainModelEndpoint: "http://localhost:8000/v1/chat/completions"
SummaryModelEndpoint: "http://localhost:8001/v1/chat/completions"
# Compression settings
TokenThreshold: 5000
EnableCompression: true
# Semantic search
SemanticApiEndpoint: "http://localhost:8002/codebase-indexer/api/v1/semantics"
TopK: 5
# Logging
LogFilePath: "logs/chat-rag.log"
LokiEndpoint: "http://localhost:3100/loki/api/v1/push"
LogBatchSize: 100
LogScanIntervalSec: 60
# Models
SummaryModel: "deepseek-chat"
# Run with default config
make run
# Run with custom config
make run-config CONFIG=path/to/your/config.yaml
# Development mode with auto-reload (requires air)
make install-air
make dev
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"stream": false
}'
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Explain this code function"}
],
"client_id": "user123",
"project_path": "/path/to/project",
"stream": false
}'
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Write a Python function"}
],
"stream": true
}'
- Token Analysis: Count tokens in incoming messages
- Threshold Check: If tokens > threshold, trigger compression
- Semantic Search: Query codebase for relevant context using latest user message
- Summarization: Use LLM to compress context + history + query
- Final Assembly: Combine system prompt + summary + latest user message
- LLM Generation: Send to main model and return response
- Request Logging: Log each request with metrics (non-blocking)
- File Storage: Write logs to local file system
- Background Processing: Periodic scan and classification using LLM
- Loki Upload: Batch upload classified logs to Loki with labels
chat-rag/
├── etc/ # Configuration files
│ └── chat-api.yaml # Service configuration
├── deploy/ # Deployment configurations
├── internal/ # Internal packages
│ ├── bootstrap/ # Service context (DI container)
│ ├── client/ # External service clients
│ │ ├── llm.go # LangChain-Go LLM client
│ │ ├── semantic.go # Semantic search client
│ ├── config/ # Configuration structures
│ │ ├── config.go
│ │ └── loader.go
│ ├── handler/ # HTTP handlers
│ ├── logic/ # Business logic
│ ├── model/ # Data models
│ ├── service/ # Background services
│ ├── strategy/ # Prompt arrangement strategy implementations
│ ├── tokenizer/ # Token counting utilities
│ ├── types/ # Generated type definitions
│ ├── utils/ # Utility functions
│ └── logger/ # Logging utilities
├── logs/ # Log files (created at runtime)
├── Makefile # Build and development commands
├── main.go # Application entry point
└── README.md # This file
make help # Show all available commands
make build # Build the application
make run # Run with default config
make test # Run tests
make fmt # Format code
make vet # Vet code
make clean # Clean build artifacts
make api-gen # Regenerate API code
make deps # Update dependencies
# Docker build
make docker-build
# Build and push Docker image
make docker-release VERSION=v1.0.0
- New API Endpoints: Update
api/chat.api
and runmake api-gen
- New Strategies: Implement
strategy.PromptProcessor
interface - New Clients: Add to
internal/client/
with proper error handling - New Services: Add to
internal/service/
with lifecycle management
Option | Description | Default |
---|---|---|
TokenThreshold |
Token count to trigger compression | 32000 |
TopK |
Number of semantic search results | 5 |
LogScanIntervalSec |
Log processing interval | 10 |
SummaryModel |
Model for summarization | deepseek-v3 |
- Request/response latencies
- Token counts (original vs compressed)
- Compression ratios
- Error rates
- Semantic search performance
- Model inference times
code_generation
: Creating new code or projectsbug_fixing
: Debugging or fixing issuesexploration
: Asking questions about codedocumentation
: Querying documentationoptimization
: Performance improvements
- gin: Web framework and microservice toolkit
- tiktoken-go: Token counting (with fallback)
- uuid: Request ID generation
- Main LLM: Primary model for chat completions
- Summary LLM: DeepSeek v3 for compression
- Semantic Search: Codebase indexer API
- Loki: Log aggregation and storage