Single-node FastAPI service that keeps a realtime model hot while scheduling batch jobs to a GPU runtime container via Docker and Redis.
- Docker + Docker Compose with NVIDIA Container Toolkit
- Host Hugging Face cache directory exported via
HOST_HF_CACHE=/path/to/cache(Compose binds it into containers at/host_hf_cache) .envbased on.env.vllm.example- Scoped auth keys file at
config/auth.keys.json(copyconfig/example.keys.json)
cp .env.vllm.example .envcp config/example.keys.json config/auth.keys.jsondocker compose up --build
Services (listening on 0.0.0.0): API http://<host>:8000, UI http://<host>:8000/dashboard, metrics http://<host>:8000/metrics (requires monitor scope), Redis redis://<host>:6379.
| Variable | Purpose |
|---|---|
RUNTIME_TYPE |
vllm (default) or sglang; controls runtime image/args |
RUNTIME_IMAGE, RUNTIME_ARGS |
Docker image and CLI args passed to llm_runtime |
REALTIME_MODEL |
Model to keep hot for /v1/completions & /v1/chat/completions |
MAX_CONCURRENT_BATCH |
Upper bound on simultaneous batch requests |
AUTH_KEYS_FILE |
Container path to scoped key JSON (mounted via Compose) |
EUROEVAL_DEBUG_LOG |
Optional path for EuroEval adapter debug output |
Update .env and restart the orchestrator container to apply changes.
| Scope | Grants |
|---|---|
realtime |
/v1/completions, /v1/chat/completions, /v1/models |
batch |
/v1/batch/*, /v1/jobs/* |
eval |
/eval/* endpoints from eval_manager |
upload |
/v1/upload/* TUS endpoints |
monitor |
/dashboard, /metrics, queue/eval UI JSON |
Scopes load once at startup; restart the container after editing config/auth.keys.json.
Realtime:
curl -sS -X POST \
-H "Authorization: Bearer $REALTIME_TOKEN" \
-H "Content-Type: application/json" \
http://localhost:8000/v1/completions \
-d '{"model":"synquid/gemma-3-27b-it-FP8","prompt":"hello","max_tokens":16}'
Batch:
curl -sS -X POST \
-H "Authorization: Bearer $BATCH_TOKEN" \
-H "Content-Type: application/json" \
http://localhost:8000/v1/batch/completions \
-d '{"model":"mistralai/Mistral-7B-Instruct-v0.1","prompt":"Write a haiku","max_tokens":32,"priority":1}'
Poll the status_url from the 202 response until status becomes completed.
- Orchestrator starts/stops the
llm_runtimecontainer as needed; usedocker logs orchestratoranddocker logs llm_runtimefor diagnostics. - Runtime crashes persist short logs under
/host_hf_cache/runtime_failures. - EuroEval adapter writes to
$EUROEVAL_DEBUG_LOG(defaults to./euroeval_debug.log). - Build custom runtime images with
runtime/Dockerfile(e.g., patched NCCL) and pointRUNTIME_IMAGEto the new tag.
eval-repo/run_eval.py implements the evaluation protocol:
python eval-repo/run_eval.py list [--base-model]python eval-repo/run_eval.py prepare <eval_name> <model> [options]python eval-repo/run_eval.py score <eval_name>(stdin JSON)
See eval-repo/PROTOCOL.md for request/response schema details.
- 503 with
Retry-After: runtime is cold-starting; wait and retry. - 401: token missing required scope (see table above).
- Batch job stuck: check Redis (
job:<id>hash) and orchestrator logs for failure flags. - No eval output: inspect
$EUROEVAL_DEBUG_LOGand EuroEval stdout for tracebacks.