Chat-RAG: OpenAI Compatible Chat Completion API with RAG Compression

A high-performance, maintainable microservice built with Gin framework that provides OpenAI-compatible /v1/chat/completions endpoint with intelligent prompt compression using RAG (Retrieval-Augmented Generation).

Features

OpenAI Compatibility: Fully compatible with OpenAI's chat completions API
Intelligent Compression: Automatically compresses long prompts using semantic search and summarization
Streaming Support: Supports both streaming and non-streaming responses
Token Management: Built-in token counting and threshold-based compression
Comprehensive Logging: Detailed logging with async processing and Loki integration
Design Patterns: Implements Strategy, Factory, and Decorator patterns for maintainability
High Performance: Built with Gin framework with proper dependency injection

Architecture

Core Components

Handler Layer: HTTP request handling with OpenAI compatibility
Logic Layer: Business logic implementation with comprehensive logging
Strategy Layer: Pluggable prompt processing strategies (direct vs compression)
Client Layer: External service communication (LLM, Semantic Search)
Service Layer: Background services (logging, classification, Loki upload)
Model Layer: Data structures and logging models

Design Patterns

Strategy Pattern: Different prompt processing strategies based on token count
Factory Pattern: Creates appropriate processors and clients
Decorator Pattern: Middleware for logging and metrics

Quick Start

Prerequisites

Go 1.22+

Installation

# Clone the repository
git clone https://github.com/zgsm-ai/chat-rag.git
cd chat-rag

# Bootstrap the project (installs tools, generates code, builds)
make bootstrap

# Or step by step:
make setup         # Generate API code and download deps
make build         # Build the application

Configuration

Edit etc/chat-api.yaml:

Name: chat-rag
Host: 0.0.0.0
Port: 8080

# Model endpoints
MainModelEndpoint: "http://localhost:8000/v1/chat/completions"
SummaryModelEndpoint: "http://localhost:8001/v1/chat/completions"

# Compression settings
TokenThreshold: 5000
EnableCompression: true

# Semantic search
SemanticApiEndpoint: "http://localhost:8002/codebase-indexer/api/v1/semantics"
TopK: 5

# Logging
LogFilePath: "logs/chat-rag.log"
LokiEndpoint: "http://localhost:3100/loki/api/v1/push"
LogBatchSize: 100
LogScanIntervalSec: 60

# Models
SummaryModel: "deepseek-chat"

Running

# Run with default config
make run

# Run with custom config
make run-config CONFIG=path/to/your/config.yaml

# Development mode with auto-reload (requires air)
make install-air
make dev

API Usage

Basic Chat Completion

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {"role": "user", "content": "Hello, how are you?"}
    ],
    "stream": false
  }'

With RAG Context

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {"role": "user", "content": "Explain this code function"}
    ],
    "client_id": "user123",
    "project_path": "/path/to/project",
    "stream": false
  }'

Streaming Response

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
      {"role": "user", "content": "Write a Python function"}
    ],
    "stream": true
  }'

How It Works

Compression Flow

Token Analysis: Count tokens in incoming messages
Threshold Check: If tokens > threshold, trigger compression
Semantic Search: Query codebase for relevant context using latest user message
Summarization: Use LLM to compress context + history + query
Final Assembly: Combine system prompt + summary + latest user message
LLM Generation: Send to main model and return response

Logging Pipeline

Request Logging: Log each request with metrics (non-blocking)
File Storage: Write logs to local file system
Background Processing: Periodic scan and classification using LLM
Loki Upload: Batch upload classified logs to Loki with labels

Project Structure

chat-rag/
├── etc/                   # Configuration files
│   └── chat-api.yaml     # Service configuration
├── deploy/               # Deployment configurations
├── internal/             # Internal packages
│   ├── bootstrap/        # Service context (DI container)
│   ├── client/           # External service clients
│   │   ├── llm.go       # LangChain-Go LLM client
│   │   ├── semantic.go  # Semantic search client
│   ├── config/          # Configuration structures
│   │   ├── config.go
│   │   └── loader.go
│   ├── handler/         # HTTP handlers
│   ├── logic/          # Business logic
│   ├── model/          # Data models
│   ├── service/        # Background services
│   ├── strategy/       # Prompt arrangement strategy implementations
│   ├── tokenizer/      # Token counting utilities
│   ├── types/          # Generated type definitions
│   ├── utils/          # Utility functions
│   └── logger/         # Logging utilities
├── logs/               # Log files (created at runtime)
├── Makefile           # Build and development commands
├── main.go           # Application entry point
└── README.md         # This file

Development

Available Commands

make help           # Show all available commands
make build          # Build the application
make run            # Run with default config
make test           # Run tests
make fmt            # Format code
make vet            # Vet code
make clean          # Clean build artifacts
make api-gen        # Regenerate API code
make deps           # Update dependencies

Docker Image

# Docker build
make docker-build

# Build and push Docker image
make docker-release VERSION=v1.0.0

Adding New Features

New API Endpoints: Update api/chat.api and run make api-gen
New Strategies: Implement strategy.PromptProcessor interface
New Clients: Add to internal/client/ with proper error handling
New Services: Add to internal/service/ with lifecycle management

Configuration Options

Option	Description	Default
`TokenThreshold`	Token count to trigger compression	32000
`TopK`	Number of semantic search results	5
`LogScanIntervalSec`	Log processing interval	10
`SummaryModel`	Model for summarization	deepseek-v3

Monitoring and Observability

Metrics Logged

Request/response latencies
Token counts (original vs compressed)
Compression ratios
Error rates
Semantic search performance
Model inference times

Log Categories

code_generation: Creating new code or projects
bug_fixing: Debugging or fixing issues
exploration: Asking questions about code
documentation: Querying documentation
optimization: Performance improvements

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
deploy		deploy
etc		etc
internal		internal
.gitignore		.gitignore
Dockerfile		Dockerfile
METRICS.md		METRICS.md
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

zgsm-ai/chat-rag

Folders and files

Latest commit

History

Repository files navigation

Chat-RAG: OpenAI Compatible Chat Completion API with RAG Compression

Features

Architecture

Core Components

Design Patterns

Quick Start

Prerequisites

Installation

Configuration

Running

API Usage

Basic Chat Completion

With RAG Context

Streaming Response

How It Works

Compression Flow

Logging Pipeline

Project Structure

Development

Available Commands

Docker Image

Adding New Features

Configuration Options

Monitoring and Observability

Metrics Logged

Log Categories

Dependencies

Core Dependencies

External Services

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages