LLM Gateway Orchestrator (realtime + batch)

Single-node FastAPI service that keeps a realtime model hot while scheduling batch jobs to a GPU runtime container via Docker and Redis.

Requirements

Docker + Docker Compose with NVIDIA Container Toolkit
Host Hugging Face cache directory exported via HOST_HF_CACHE=/path/to/cache (Compose binds it into containers at /host_hf_cache)
.env based on .env.vllm.example
Scoped auth keys file at config/auth.keys.json (copy config/example.keys.json)

Quickstart

cp .env.vllm.example .env
cp config/example.keys.json config/auth.keys.json
docker compose up --build

Services (listening on 0.0.0.0): API http://<host>:8000, UI http://<host>:8000/dashboard, metrics http://<host>:8000/metrics (requires monitor scope), Redis redis://<host>:6379.

Configuration

Variable	Purpose
`RUNTIME_TYPE`	`vllm` (default) or `sglang`; controls runtime image/args
`RUNTIME_IMAGE`, `RUNTIME_ARGS`	Docker image and CLI args passed to `llm_runtime`
`REALTIME_MODEL`	Model to keep hot for `/v1/completions` & `/v1/chat/completions`
`MAX_CONCURRENT_BATCH`	Upper bound on simultaneous batch requests
`AUTH_KEYS_FILE`	Container path to scoped key JSON (mounted via Compose)
`EUROEVAL_DEBUG_LOG`	Optional path for EuroEval adapter debug output

Update .env and restart the orchestrator container to apply changes.

Auth Scopes

Scope	Grants
`realtime`	`/v1/completions`, `/v1/chat/completions`, `/v1/models`
`batch`	`/v1/batch/`, `/v1/jobs/`
`eval`	`/eval/*` endpoints from `eval_manager`
`upload`	`/v1/upload/*` TUS endpoints
`monitor`	`/dashboard`, `/metrics`, queue/eval UI JSON

Scopes load once at startup; restart the container after editing config/auth.keys.json.

API Usage

Realtime:

curl -sS -X POST \
  -H "Authorization: Bearer $REALTIME_TOKEN" \
  -H "Content-Type: application/json" \
  http://localhost:8000/v1/completions \
  -d '{"model":"synquid/gemma-3-27b-it-FP8","prompt":"hello","max_tokens":16}'

Batch:

curl -sS -X POST \
  -H "Authorization: Bearer $BATCH_TOKEN" \
  -H "Content-Type: application/json" \
  http://localhost:8000/v1/batch/completions \
  -d '{"model":"mistralai/Mistral-7B-Instruct-v0.1","prompt":"Write a haiku","max_tokens":32,"priority":1}'

Poll the status_url from the 202 response until status becomes completed.

Runtime & Logs

Orchestrator starts/stops the llm_runtime container as needed; use docker logs orchestrator and docker logs llm_runtime for diagnostics.
Runtime crashes persist short logs under /host_hf_cache/runtime_failures.
EuroEval adapter writes to $EUROEVAL_DEBUG_LOG (defaults to ./euroeval_debug.log).
Build custom runtime images with runtime/Dockerfile (e.g., patched NCCL) and point RUNTIME_IMAGE to the new tag.

Evaluations

eval-repo/run_eval.py implements the evaluation protocol:

python eval-repo/run_eval.py list [--base-model]
python eval-repo/run_eval.py prepare <eval_name> <model> [options]
python eval-repo/run_eval.py score <eval_name> (stdin JSON)

See eval-repo/PROTOCOL.md for request/response schema details.

Troubleshooting

503 with Retry-After: runtime is cold-starting; wait and retry.
401: token missing required scope (see table above).
Batch job stuck: check Redis (job:<id> hash) and orchestrator logs for failure flags.
No eval output: inspect $EUROEVAL_DEBUG_LOG and EuroEval stdout for tracebacks.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
config		config
eval-repo		eval-repo
orchestrator		orchestrator
runtime		runtime
.env.vllm.example		.env.vllm.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Gateway Orchestrator (realtime + batch)

Requirements

Quickstart

Configuration

Auth Scopes

API Usage

Runtime & Logs

Evaluations

Troubleshooting

About

Uh oh!

Uh oh!

Languages

alexandrainst/llm-gateway

Folders and files

Latest commit

History

Repository files navigation

LLM Gateway Orchestrator (realtime + batch)

Requirements

Quickstart

Configuration

Auth Scopes

API Usage

Runtime & Logs

Evaluations

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages