A hackable, modular, containerized inference server for deploying large language models in local or hybrid environments.
See docs page.
- Python 3.10+ (if running locally)
- Docker & Docker Compose with Docker Desktop (recommended for containerized usage)
- Poetry (if installing locally)
- Clone the Repository
git clone https://github.com/tmcarmichael/fabricai-inference-server.git
cd fabricai-inference-server
- Download the Model
Suggested: TheBloke/Llama-2-13B-Ensemble-v5-GGUF (https://huggingface.co/TheBloke/Llama-2-13B-Ensemble-v5-GGUF)
Check hardware compatibility (Huggingface supports a check for this), if needed use 4bit or 3bit quantization.
- Configure Model Path
Create a .env
file at the project root:
cp .env.example .env
Edit .env
to set:
LOCAL_MODEL_DIR=/absolute/path/to/your/large-model
LLM_MODEL=/models/llama-2-13b-ensemble-v5.Q4_K_M.gguf
- Run with Docker
Build & start:
docker-compose up --build
This spins up:
- fabricai-inference-server (FastAPI, uvicorn)
- Redis (for session/conversation memory)
- Test the server
SSE Endpoint:
curl -N -X POST http://localhost:8000/v1/inference_sse \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello from Docker!"}'
Status:
curl http://localhost:8000/v1/status
- [Optional] Local Development Environment without Docker
Install Poetry:
pip install --upgrade poetry
Install Dependencies:
poetry install
Start the Server:
poetry run uvicorn fabricai_inference_server.server:app --host 0.0.0.0 --port 8000
- [Optional] Event-based Streaming
Socket.IO Support: Connect via Socket.IO at ws://localhost:8000 and emit the "inference_prompt" event.