Unified management and routing for llama.cpp, MLX and vLLM models with web dashboard.
π Easy Model Management
- Multiple Models Simultaneously: Run different models at the same time (7B for speed, 70B for quality)
- Smart Resource Management: Automatic idle timeout, LRU eviction, and configurable instance limits
- Web Dashboard: Modern React UI for managing instances, monitoring health, and viewing logs
π Flexible Integration
- OpenAI API Compatible: Drop-in replacement - route requests to different models by instance name
- Multi-Backend Support: Native support for llama.cpp, MLX (Apple Silicon optimized), and vLLM
- Docker Ready: Run backends in containers with full GPU support
π Distributed Deployment
- Remote Instances: Deploy instances on remote hosts
- Central Management: Manage everything from a single dashboard with automatic routing
- Install a backend (llama.cpp, MLX, or vLLM) - see Prerequisites below
- Download llamactl for your platform
- Run
llamactland open http://localhost:8080 - Create an instance and start inferencing!
For llama.cpp backend:
You need llama-server from llama.cpp installed:
# Homebrew (macOS)
brew install llama.cpp
# Or build from source - see llama.cpp docs
# Or use Docker - no local installation requiredFor MLX backend (macOS only): You need MLX-LM installed:
# Install via pip (requires Python 3.8+)
pip install mlx-lm
# Or in a virtual environment (recommended)
python -m venv mlx-env
source mlx-env/bin/activate
pip install mlx-lmFor vLLM backend: You need vLLM installed:
# Install via pip (requires Python 3.8+, GPU required)
pip install vllm
# Or in a virtual environment (recommended)
python -m venv vllm-env
source vllm-env/bin/activate
pip install vllm
# Or use Docker - no local installation requiredllamactl can run backends in Docker containers, eliminating the need for local backend installation:
backends:
llama-cpp:
docker:
enabled: true
vllm:
docker:
enabled: true# Linux/macOS - Get latest version and download
LATEST_VERSION=$(curl -s https://api.github.com/repos/lordmathis/llamactl/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/')
curl -L https://github.com/lordmathis/llamactl/releases/download/${LATEST_VERSION}/llamactl-${LATEST_VERSION}-$(uname -s | tr '[:upper:]' '[:lower:]')-$(uname -m).tar.gz | tar -xz
sudo mv llamactl /usr/local/bin/
# Or download manually from the releases page:
# https://github.com/lordmathis/llamactl/releases/latest
# Windows - Download from releases page# Clone repository and build Docker images
git clone https://github.com/lordmathis/llamactl.git
cd llamactl
mkdir -p data/llamacpp data/vllm models
# Build and start llamactl with llama.cpp CUDA backend
docker-compose -f docker/docker-compose.yml up llamactl-llamacpp -d
# Build and start llamactl with vLLM CUDA backend
docker-compose -f docker/docker-compose.yml up llamactl-vllm -d
# Build from source using multi-stage build
docker build -f docker/Dockerfile.source -t llamactl:source .Note: Dockerfiles are configured for CUDA. Adapt base images for other platforms (CPU, ROCm, etc.).
Requires Go 1.24+ and Node.js 22+
git clone https://github.com/lordmathis/llamactl.git
cd llamactl
cd webui && npm ci && npm run build && cd ..
go build -o llamactl ./cmd/server- Open http://localhost:8080
- Click "Create Instance"
- Choose backend type (llama.cpp, MLX, or vLLM)
- Configure your model and options (ports and API keys are auto-assigned)
- Start the instance and use it with any OpenAI-compatible client
llamactl works out of the box with sensible defaults.
server:
host: "0.0.0.0" # Server host to bind to
port: 8080 # Server port to bind to
allowed_origins: ["*"] # Allowed CORS origins (default: all)
allowed_headers: ["*"] # Allowed CORS headers (default: all)
enable_swagger: false # Enable Swagger UI for API docs
backends:
llama-cpp:
command: "llama-server"
args: []
environment: {} # Environment variables for the backend process
docker:
enabled: false
image: "ghcr.io/ggml-org/llama.cpp:server"
args: ["run", "--rm", "--network", "host", "--gpus", "all", "-v", "~/.local/share/llamactl/llama.cpp:/root/.cache/llama.cpp"]
environment: {} # Environment variables for the container
vllm:
command: "vllm"
args: ["serve"]
environment: {} # Environment variables for the backend process
docker:
enabled: false
image: "vllm/vllm-openai:latest"
args: ["run", "--rm", "--network", "host", "--gpus", "all", "--shm-size", "1g", "-v", "~/.local/share/llamactl/huggingface:/root/.cache/huggingface"]
environment: {} # Environment variables for the container
mlx:
command: "mlx_lm.server"
args: []
environment: {} # Environment variables for the backend process
instances:
port_range: [8000, 9000] # Port range for instances
data_dir: ~/.local/share/llamactl # Data directory (platform-specific, see below)
configs_dir: ~/.local/share/llamactl/instances # Instance configs directory
logs_dir: ~/.local/share/llamactl/logs # Logs directory
auto_create_dirs: true # Auto-create data/config/logs dirs if missing
max_instances: -1 # Max instances (-1 = unlimited)
max_running_instances: -1 # Max running instances (-1 = unlimited)
enable_lru_eviction: true # Enable LRU eviction for idle instances
default_auto_restart: true # Auto-restart new instances by default
default_max_restarts: 3 # Max restarts for new instances
default_restart_delay: 5 # Restart delay (seconds) for new instances
default_on_demand_start: true # Default on-demand start setting
on_demand_start_timeout: 120 # Default on-demand start timeout in seconds
timeout_check_interval: 5 # Idle instance timeout check in minutes
auth:
require_inference_auth: true # Require auth for inference endpoints
inference_keys: [] # Keys for inference endpoints
require_management_auth: true # Require auth for management endpoints
management_keys: [] # Keys for management endpointsFor detailed configuration options including environment variables, file locations, and advanced settings, see the Configuration Guide.
MIT License - see LICENSE file.
