Pro-Mode Nemotron Nano (9B-v2) on RTX 3090 using vLLM — Docker/WSL2

A tiny, production-style stack that serves NVIDIA Nemotron-Nano-9B-v2 via vLLM (OpenAI-compatible API on :9090) and a Pro-Mode wrapper (best-of-N + early-stop + synthesis) on :9099 — all in Docker with GPU acceleration on Windows 11 + WSL2 with RTX 24GB + GPUs

Model server: vLLM (/v1/*) on port 9090
Wrapper API: FastAPI /pro-mode on port 9099
GPU: CUDA (tested on RTX 3090, 24 GB)
Reasoning toggle: /think vs /no_think
Trace option: return per-candidate thoughts/answers/scores

See example_request.md for a runnable curl and sample outputs (with and without trace).

What’s inside

nemotron-stack/
├─ docker-compose.yml        # vLLM + pro-mode services (GPU enabled)
├─ .env                      # optional HF token
└─ promode/
   ├─ Dockerfile
   ├─ requirements.txt
   └─ server.py              # serial best-of-N + judge + synthesis

Architecture

vLLM hosts nvidia/NVIDIA-Nemotron-Nano-9B-v2 with Mamba/SSM settings tuned for quality and 24 GB GPUs.
Pro-Mode (FastAPI) calls vLLM’s OpenAI-compatible /chat/completions, generating candidates one-by-one (serial), early-stops when a score threshold is met, then synthesizes a final answer with a clean “expert editor” prompt.
You can return the intermediate chain-of-thought per candidate (when return_trace=true), while keeping the final answer concise.

Prerequisites

Windows 11 with WSL2 (Ubuntu recommended)
NVIDIA driver (recent), Docker Desktop with WSL2 integration and GPU support enabled
docker compose available in your WSL shell

Sanity check GPU inside Docker:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Quick start

Clone this repo (or drop these files into a folder).
Optional: put your HF token in .env
```
HUGGING_FACE_HUB_TOKEN=
```
Bring it up:
```
docker compose up -d --build
docker compose logs -f vllm
```
First run will download weights to a cached volume.

Health checks

curl http://localhost:9090/health        # vLLM
curl http://localhost:9099/health        # Pro-Mode
curl http://localhost:9090/v1/models

Call vLLM directly

curl http://localhost:9090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"nvidia/NVIDIA-Nemotron-Nano-9B-v2",
    "messages":[
      {"role":"system","content":"/think"},
      {"role":"user","content":"Write a limerick about GPUs."}
    ],
    "max_tokens":128
  }'

Call Pro-Mode (serial best-of-N + judge + synthesis)

curl -s http://localhost:9099/pro-mode \
  -H "Content-Type: application/json" \
  -d '{
    "prompt":"Outline a weekend hiking plan in Snowdonia with safety tips.",
    "num_gens": 6,
    "think": true,
    "max_tokens": 400,
    "temperature": 0.8,
    "top_p": 0.9,
    "pass_score": 9.0,
    "do_synthesis": true,
    "judge": true,
    "return_trace": true
  }'

A prettier example (and sample outputs) lives in example_request.md.

Docker Compose notes

The vllm service is configured for Mamba/SSM models and 24 GB GPUs:

--dtype float16 (stable on RTX 3090 / Ampere)
--mamba-ssm-cache-dtype float32 (quality knob for Nemotron-Nano v2)
--max-model-len defaults small for stability; raise later as needed
--gpu-memory-utilization tuned (0.85–0.92) to leave room for KV cache
optional --cpu-offload-gb to push part of the cache to CPU RAM if you need longer contexts on a 24 GB card

The compose also mounts caches:

hf-cache → /root/.cache/huggingface (weights)
torch-cache → /root/.cache/torch (compiled kernels)
triton-cache → /root/.triton (compiler cache)

These volumes persist between restarts so you don’t redownload/recompile.

Pro-Mode API (server.py)

Endpoint

POST /pro-mode → structured JSON result
GET /health → {"status":"ok"}

Request schema (key fields)

{
  "prompt": "your task",
  "num_gens": 8,            // number of candidates to try (serial)
  "think": true,            // use Nemotron's /think toggle for candidates
  "max_tokens": 512,
  "temperature": 0.8,
  "top_p": 0.9,
  "pass_score": 8.5,        // early-stop threshold (judge score)
  "do_synthesis": true,     // run final merge
  "judge": true,            // self-judge for early-stop + ranking
  "seed0": 42,              // seeds increment per run
  "return_trace": false     // include candidates & thoughts if true
}

What it does

Fan-out serially: generate one candidate at a time (streaming disabled for simplicity).
Judge + early-stop: after each candidate, a small rubric scores 1..10. If >= pass_score, stop early.
Synthesis: take the top K (default 5) answers (not thoughts) and call a final editor-style merge:
- System: “You are an expert editor… Do not mention the candidates or the synthesis process.”
- We force /no_think in synthesis so the final answer is concise.
- A tiny sanitizer trims any leftover meta preface.
Trace (optional): when return_trace=true you get, per candidate:
- seed, thoughts (from <think>…</think>), answer (cleaned), score,
- plus synthesis used_indices, raw (full text), and thoughts if present.

Response shape (abridged)

{
  "final_answer": "...",
  "used_candidates": 3,
  "scores": [9.1, 8.6, 9.0],
  "candidates": ["raw text with <think>...</think>", "..."],   // only if return_trace=true
  "trace": {
    "candidates": [{ "seed": 42, "thoughts": "...", "answer": "...", "score": 9.1 }, ...],
    "synthesis": { "used_indices": [0,2], "thoughts": null, "raw": "..." }
  }
}

Tuning guide (RTX 3090, 24 GB)

Start conservative; then scale up:

--dtype float16
--max-model-len 8192 → 16384 → 32768 (only raise if your prompts need it)
--max-num-seqs 1 → 2
--gpu-memory-utilization 0.85–0.92
--mamba-ssm-cache-dtype float32 (recommended by the model card)
Optional: --cpu-offload-gb 6–10 to gain KV headroom (latency trade-off)
Bring chunked prefill back later by removing --no-enable-chunked-prefill once stable

Why KV cache fails (common error): Weights (~16.6 GiB) + activations can leave ~0 GiB for KV at high contexts. Reduce --max-model-len or add --cpu-offload-gb.

Monitoring & troubleshooting

Watch GPU (inside WSL or container):

watch -n 0.5 nvidia-smi
# or
nvidia-smi --query-gpu=memory.total,memory.used,utilization.gpu --format=csv -l 1

Server ready?

curl http://localhost:9090/health
curl http://localhost:9090/v1/models

Typical fixes
- “No available memory for the cache blocks”: lower --max-model-len, raise --gpu-memory-utilization, or add --cpu-offload-gb.
- Connection reset during startup: the model is still loading; wait for “serving” lines in docker logs -f vllm.
- Slow first run: that’s weight download + kernel compile. Volumes cache both for next time.

Design choices (why this)

Serial best-of-N gives you an anytime algorithm with early-stop → often fewer total tokens vs parallel fan-out.
Judge is cheap but useful: it gates synthesis and selects top-K.
Editor-style synthesis yields a crisp final answer (no “I merged X & Y…” narration), while trace preserves all thoughts.

Customize

Switch synthesis style to allow an explanation (set a flag to use /think and a “resolve & explain” prompt).
Add mini-batch parallelism for candidates with asyncio.gather if you want lower latency.
Tighten the synthesis temperature to 0.1 for maximum determinism.

Credits

Model: NVIDIA/Nemotron-Nano-9B-v2
Inference: vLLM (OpenAI-compatible server)
Wrapper concept inspired by Matt Shumer’s GPT Pro-Mode (multi-sample + synth).

License

This repo’s code is under your chosen license (add one). Respect the licenses/usage terms of the underlying model (NVIDIA) and tools (vLLM, FastAPI, etc.).

Next steps

Try the example_request.md curl and compare the “clean final answer” vs. “trace with thoughts.”
Plug the /pro-mode endpoint into your agent’s planner/coder loop (use pass_score and num_gens as quality/cost knobs).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
promode		promode
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
example_request.md		example_request.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pro-Mode Nemotron Nano (9B-v2) on RTX 3090 using vLLM — Docker/WSL2

What’s inside

Architecture

Prerequisites

Quick start

Docker Compose notes

Pro-Mode API (server.py)

Endpoint

Request schema (key fields)

What it does

Response shape (abridged)

Tuning guide (RTX 3090, 24 GB)

Monitoring & troubleshooting

Design choices (why this)

Customize

Credits

License

Next steps

About

Uh oh!

Releases

Packages

Languages

brianlmerritt/nemotron-nano-9b-v2-pro-mode

Folders and files

Latest commit

History

Repository files navigation

Pro-Mode Nemotron Nano (9B-v2) on RTX 3090 using vLLM — Docker/WSL2

What’s inside

Architecture

Prerequisites

Quick start

Docker Compose notes

Pro-Mode API (server.py)

Endpoint

Request schema (key fields)

What it does

Response shape (abridged)

Tuning guide (RTX 3090, 24 GB)

Monitoring & troubleshooting

Design choices (why this)

Customize

Credits

License

Next steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages