Highly configurable API with multiple LLM inference backends and persistent storage.
- Start-up
- Generation Playground
- Assistant Playground
- Semantic Search Playground
- Tuning Playground
- Architecture
# Install & activate
> uv sync && source .venv/bin/activate
# Download model
> python scripts/download_model.py transformers google/gemma-3-270m --output-dir ./models/gemma-3-270m
# start backend
> python main.py --config config.yaml
> cd frontend
> npm install
> npm run dev
Backend API: http://localhost:8000
| Docs: http://localhost:8000/docs
Frontend UI: http://localhost:5173
Development (config.development.yaml
):
database:
url: "sqlite+aiosqlite:///./data/conversations.db"
echo: true # SQL query logging
logging:
level: "DEBUG"
enable_console: true
enable_file: true
json_format: false
Production (config.production.yaml
):
database:
url: "${DATABASE_URL}"
echo: false
pool_size: 20
logging:
level: "INFO"
enable_console: false
enable_file: true
json_format: true
Backend | Best For | Features |
---|---|---|
Transformers | Research, wide compatibility | Quantization, GPU acceleration |
llama.cpp | CPU inference, low memory | GGUF format, hybrid CPU/GPU |
vLLM | Production, high throughput | PagedAttention, tensor parallel |
Ollama | Easy setup, model management | Built-in downloads, streaming |
- React UI (TypeScript, Vite, Tailwind CSS)
- Text Generation (Text completion interface)
- Continue and expand writing with AI assistance
# generation
curl -X POST http://localhost:8000/generate \
-d '{"prompt": "Hello", "parameters": {"temperature": 0.7}}'
# Stateless chat
curl -X POST http://localhost:8000/chat \
-d '{"messages": [{"role": "user", "content": "Hi"}]}'
- Real-time conversations with message history
- Session-based conversations (Persistent SQL database storage)
- Conversation Management (List, view, and delete conversation sessions)
# Start conversation
SESSION_ID=$(curl -X POST http://localhost:8000/conversations/start \
-H "Content-Type: application/json" \
-d '{"system_prompt": "You are helpful"}' | jq -r '.session_id')
# Send messages
curl -X POST http://localhost:8000/conversations/$SESSION_ID/message \
-H "Content-Type: application/json" \
-d '{"message": "Hello!", "parameters": {"max_tokens": 100}}'
# Get history
curl http://localhost:8000/conversations/$SESSION_ID
- Multiple Backends (Transformers, llama.cpp, vLLM, Ollama)
- Dual Modes (text completion + assistant chat)
- Logging (File rotation, JSON format, configurable levels)
- Plug & Play (Config-driven model/backend switching)
- RAG (Retrieval-Augmented Generation) (Semantic search with ChromaDB)
- Document Processing (PDF, Word, TXT file upload and chunking)
- Semantic search over documents and conversations
# Upload a document
curl -X POST http://localhost:8000/rag/upload \
-F "file=@document.pdf"
# Search documents and conversations
curl -X POST http://localhost:8000/rag/search \
-H "Content-Type: application/json" \
-d '{"query": "machine learning concepts", "limit": 10}'
# List uploaded documents
curl http://localhost:8000/rag/documents
Frontend:
- Dataset uploader
- Training config selector (LoRA vs QLoRA vs full)
- Progress dashboard
Backend:
- Job queue (FastAPI background tasks / Celery)
- Training pipeline (Hugging Face transformers, peft, accelerate)
- Model registry with versioning
Serving:
- Dynamically load fine-tuned models
- Let user switch between base and tuned versions
Dedicated tab where users can:
- Upload a few examples
- Choose a tuning method
- Preview model behavior before and after
Like giving user a sandbox to test model behavior.
├── llm_playground/
│ ├── api/ # FastAPI endpoints
│ ├── backends/ # Transformers, llama.cpp, vLLM, Ollama
│ ├── core/ # Services, SQL conversation management, RAG
│ ├── models/ # Pydantic schemas, SQLAlchemy database models
│ └── config/ # Configuration management
├── scripts/ # Download models, run modes
├── data/ # SQLite database, ChromaDB vector store
└── logs/ # Application & access logs
frontend/
├── src/
│ ├── components/ # React components
│ │ ├── chat/ # Chat interface components
│ │ ├── conversations/ # Conversation management
│ │ ├── generation/ # Text generation interface
│ │ ├── layout/ # Layout components (Sidebar)
│ │ ├── rag/ # RAG/Knowledge base components
│ │ ├── settings/ # Settings and system info
│ │ └── ui/ # Reusable UI components
│ ├── lib/ # API client and utilities
│ ├── types/ # TypeScript type definitions
│ └── App.tsx # Main application component
├── public/ # Static assets
└── package.json # Dependencies and scripts
# config.yaml
database:
url: "sqlite+aiosqlite:///./data/conversations.db"
echo: false
export DATABASE_URL="postgresql+asyncpg://user:password@host:5432/db"
python main.py
# config.production.yaml
database:
url: ${DATABASE_URL}
pool_size: 20
max_overflow: 30
- Location:
./data/conversations.db
(project directory) - Access:
sqlite3 ./data/conversations.db
- Benefits: Zero setup, portable, version-controllable
- Location: External database server or Docker container
- Access: Standard PostgreSQL tools (
psql
, pgAdmin) - Benefits: Concurrent users, ACID transactions, full-text search
-- Conversations table
CREATE TABLE conversations (
session_id VARCHAR(36) PRIMARY KEY,
created_at TIMESTAMP NOT NULL,
last_updated TIMESTAMP NOT NULL,
system_prompt TEXT,
model_name VARCHAR(255),
backend VARCHAR(50),
last_parameters JSON,
message_count INTEGER NOT NULL DEFAULT 0
);
-- Messages table
CREATE TABLE messages (
id SERIAL PRIMARY KEY,
session_id VARCHAR(36) REFERENCES conversations(session_id),
role VARCHAR(20) NOT NULL,
content TEXT NOT NULL,
timestamp TIMESTAMP NOT NULL,
sequence_number INTEGER NOT NULL
);
-- RAG Documents table
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
filename VARCHAR(255) NOT NULL,
content_hash VARCHAR(64) NOT NULL UNIQUE,
file_size INTEGER NOT NULL,
document_metadata JSON,
created_at TIMESTAMP NOT NULL,
updated_at TIMESTAMP NOT NULL
);
-- RAG Document Chunks table
CREATE TABLE document_chunks (
id SERIAL PRIMARY KEY,
document_id INTEGER REFERENCES documents(id),
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
chunk_metadata JSON,
created_at TIMESTAMP NOT NULL
);
-- RAG Conversation Chunks table (for searchable chat history)
CREATE TABLE conversation_chunks (
id SERIAL PRIMARY KEY,
session_id VARCHAR(36) REFERENCES conversations(session_id),
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
chunk_metadata JSON,
created_at TIMESTAMP NOT NULL
);
# Use docker-compose for full stack
docker-compose up -d
# Use cloud PostgreSQL (AWS RDS, Google Cloud SQL, etc.)
export DATABASE_URL="postgresql+asyncpg://user:pass@db-host:5432/llm_playground"
python main.py --config config.production.yaml
# Single-server deployments
python main.py --config config.yaml