A tiny, production-style stack that serves NVIDIA Nemotron-Nano-9B-v2 via vLLM (OpenAI-compatible API on :9090) and a Pro-Mode wrapper (best-of-N + early-stop + synthesis) on :9099 — all in Docker with GPU acceleration on Windows 11 + WSL2 with RTX 24GB + GPUs
- Model server: vLLM (
/v1/*) on port 9090 - Wrapper API: FastAPI
/pro-modeon port 9099 - GPU: CUDA (tested on RTX 3090, 24 GB)
- Reasoning toggle:
/thinkvs/no_think - Trace option: return per-candidate thoughts/answers/scores
See example_request.md for a runnable curl and sample outputs (with and without trace).
nemotron-stack/
├─ docker-compose.yml # vLLM + pro-mode services (GPU enabled)
├─ .env # optional HF token
└─ promode/
├─ Dockerfile
├─ requirements.txt
└─ server.py # serial best-of-N + judge + synthesis
- vLLM hosts
nvidia/NVIDIA-Nemotron-Nano-9B-v2with Mamba/SSM settings tuned for quality and 24 GB GPUs. - Pro-Mode (FastAPI) calls vLLM’s OpenAI-compatible /chat/completions, generating candidates one-by-one (serial), early-stops when a score threshold is met, then synthesizes a final answer with a clean “expert editor” prompt.
- You can return the intermediate chain-of-thought per candidate (when
return_trace=true), while keeping the final answer concise.
- Windows 11 with WSL2 (Ubuntu recommended)
- NVIDIA driver (recent), Docker Desktop with WSL2 integration and GPU support enabled
docker composeavailable in your WSL shell
Sanity check GPU inside Docker:
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi-
Clone this repo (or drop these files into a folder).
-
Optional: put your HF token in
.envHUGGING_FACE_HUB_TOKEN= -
Bring it up:
docker compose up -d --build docker compose logs -f vllm
First run will download weights to a cached volume.
-
Health checks
curl http://localhost:9090/health # vLLM curl http://localhost:9099/health # Pro-Mode curl http://localhost:9090/v1/models
-
Call vLLM directly
curl http://localhost:9090/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model":"nvidia/NVIDIA-Nemotron-Nano-9B-v2", "messages":[ {"role":"system","content":"/think"}, {"role":"user","content":"Write a limerick about GPUs."} ], "max_tokens":128 }'
-
Call Pro-Mode (serial best-of-N + judge + synthesis)
curl -s http://localhost:9099/pro-mode \ -H "Content-Type: application/json" \ -d '{ "prompt":"Outline a weekend hiking plan in Snowdonia with safety tips.", "num_gens": 6, "think": true, "max_tokens": 400, "temperature": 0.8, "top_p": 0.9, "pass_score": 9.0, "do_synthesis": true, "judge": true, "return_trace": true }'
A prettier example (and sample outputs) lives in example_request.md.
The vllm service is configured for Mamba/SSM models and 24 GB GPUs:
--dtype float16(stable on RTX 3090 / Ampere)--mamba-ssm-cache-dtype float32(quality knob for Nemotron-Nano v2)--max-model-lendefaults small for stability; raise later as needed--gpu-memory-utilizationtuned (0.85–0.92) to leave room for KV cache- optional
--cpu-offload-gbto push part of the cache to CPU RAM if you need longer contexts on a 24 GB card
The compose also mounts caches:
hf-cache→/root/.cache/huggingface(weights)torch-cache→/root/.cache/torch(compiled kernels)triton-cache→/root/.triton(compiler cache)
These volumes persist between restarts so you don’t redownload/recompile.
POST /pro-mode→ structured JSON resultGET /health→{"status":"ok"}
{
"prompt": "your task",
"num_gens": 8, // number of candidates to try (serial)
"think": true, // use Nemotron's /think toggle for candidates
"max_tokens": 512,
"temperature": 0.8,
"top_p": 0.9,
"pass_score": 8.5, // early-stop threshold (judge score)
"do_synthesis": true, // run final merge
"judge": true, // self-judge for early-stop + ranking
"seed0": 42, // seeds increment per run
"return_trace": false // include candidates & thoughts if true
}-
Fan-out serially: generate one candidate at a time (streaming disabled for simplicity).
-
Judge + early-stop: after each candidate, a small rubric scores
1..10. If>= pass_score, stop early. -
Synthesis: take the top K (default 5) answers (not thoughts) and call a final editor-style merge:
- System: “You are an expert editor… Do not mention the candidates or the synthesis process.”
- We force
/no_thinkin synthesis so the final answer is concise. - A tiny sanitizer trims any leftover meta preface.
-
Trace (optional): when
return_trace=trueyou get, per candidate:seed,thoughts(from<think>…</think>),answer(cleaned),score,- plus synthesis
used_indices,raw(full text), andthoughtsif present.
{
"final_answer": "...",
"used_candidates": 3,
"scores": [9.1, 8.6, 9.0],
"candidates": ["raw text with <think>...</think>", "..."], // only if return_trace=true
"trace": {
"candidates": [{ "seed": 42, "thoughts": "...", "answer": "...", "score": 9.1 }, ...],
"synthesis": { "used_indices": [0,2], "thoughts": null, "raw": "..." }
}
}Start conservative; then scale up:
--dtype float16--max-model-len 8192 → 16384 → 32768(only raise if your prompts need it)--max-num-seqs 1 → 2--gpu-memory-utilization 0.85–0.92--mamba-ssm-cache-dtype float32(recommended by the model card)- Optional:
--cpu-offload-gb 6–10to gain KV headroom (latency trade-off) - Bring chunked prefill back later by removing
--no-enable-chunked-prefillonce stable
Why KV cache fails (common error):
Weights (~16.6 GiB) + activations can leave ~0 GiB for KV at high contexts. Reduce --max-model-len or add --cpu-offload-gb.
-
Watch GPU (inside WSL or container):
watch -n 0.5 nvidia-smi # or nvidia-smi --query-gpu=memory.total,memory.used,utilization.gpu --format=csv -l 1 -
Server ready?
curl http://localhost:9090/health curl http://localhost:9090/v1/models
-
Typical fixes
- “No available memory for the cache blocks”: lower
--max-model-len, raise--gpu-memory-utilization, or add--cpu-offload-gb. - Connection reset during startup: the model is still loading; wait for “serving” lines in
docker logs -f vllm. - Slow first run: that’s weight download + kernel compile. Volumes cache both for next time.
- “No available memory for the cache blocks”: lower
- Serial best-of-N gives you an anytime algorithm with early-stop → often fewer total tokens vs parallel fan-out.
- Judge is cheap but useful: it gates synthesis and selects top-K.
- Editor-style synthesis yields a crisp final answer (no “I merged X & Y…” narration), while trace preserves all thoughts.
- Switch synthesis style to allow an explanation (set a flag to use
/thinkand a “resolve & explain” prompt). - Add mini-batch parallelism for candidates with
asyncio.gatherif you want lower latency. - Tighten the synthesis temperature to 0.1 for maximum determinism.
- Model: NVIDIA/Nemotron-Nano-9B-v2
- Inference: vLLM (OpenAI-compatible server)
- Wrapper concept inspired by Matt Shumer’s GPT Pro-Mode (multi-sample + synth).
This repo’s code is under your chosen license (add one). Respect the licenses/usage terms of the underlying model (NVIDIA) and tools (vLLM, FastAPI, etc.).
- Try the
example_request.mdcurl and compare the “clean final answer” vs. “trace with thoughts.” - Plug the
/pro-modeendpoint into your agent’s planner/coder loop (usepass_scoreandnum_gensas quality/cost knobs).