Skip to content

Inspired by Matt Shumer's "GPT Pro-Mode", runs nemotron locally on a small gpu (RTX 3090) and makes multiple requests in series and synthesizes the results.

Notifications You must be signed in to change notification settings

brianlmerritt/nemotron-nano-9b-v2-pro-mode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pro-Mode Nemotron Nano (9B-v2) on RTX 3090 using vLLM — Docker/WSL2

A tiny, production-style stack that serves NVIDIA Nemotron-Nano-9B-v2 via vLLM (OpenAI-compatible API on :9090) and a Pro-Mode wrapper (best-of-N + early-stop + synthesis) on :9099 — all in Docker with GPU acceleration on Windows 11 + WSL2 with RTX 24GB + GPUs

  • Model server: vLLM (/v1/*) on port 9090
  • Wrapper API: FastAPI /pro-mode on port 9099
  • GPU: CUDA (tested on RTX 3090, 24 GB)
  • Reasoning toggle: /think vs /no_think
  • Trace option: return per-candidate thoughts/answers/scores

See example_request.md for a runnable curl and sample outputs (with and without trace).


What’s inside

nemotron-stack/
├─ docker-compose.yml        # vLLM + pro-mode services (GPU enabled)
├─ .env                      # optional HF token
└─ promode/
   ├─ Dockerfile
   ├─ requirements.txt
   └─ server.py              # serial best-of-N + judge + synthesis

Architecture

  • vLLM hosts nvidia/NVIDIA-Nemotron-Nano-9B-v2 with Mamba/SSM settings tuned for quality and 24 GB GPUs.
  • Pro-Mode (FastAPI) calls vLLM’s OpenAI-compatible /chat/completions, generating candidates one-by-one (serial), early-stops when a score threshold is met, then synthesizes a final answer with a clean “expert editor” prompt.
  • You can return the intermediate chain-of-thought per candidate (when return_trace=true), while keeping the final answer concise.

Prerequisites

  • Windows 11 with WSL2 (Ubuntu recommended)
  • NVIDIA driver (recent), Docker Desktop with WSL2 integration and GPU support enabled
  • docker compose available in your WSL shell

Sanity check GPU inside Docker:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Quick start

  1. Clone this repo (or drop these files into a folder).

  2. Optional: put your HF token in .env

    HUGGING_FACE_HUB_TOKEN=
    
  3. Bring it up:

    docker compose up -d --build
    docker compose logs -f vllm

    First run will download weights to a cached volume.

  4. Health checks

    curl http://localhost:9090/health        # vLLM
    curl http://localhost:9099/health        # Pro-Mode
    curl http://localhost:9090/v1/models
  5. Call vLLM directly

    curl http://localhost:9090/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model":"nvidia/NVIDIA-Nemotron-Nano-9B-v2",
        "messages":[
          {"role":"system","content":"/think"},
          {"role":"user","content":"Write a limerick about GPUs."}
        ],
        "max_tokens":128
      }'
  6. Call Pro-Mode (serial best-of-N + judge + synthesis)

    curl -s http://localhost:9099/pro-mode \
      -H "Content-Type: application/json" \
      -d '{
        "prompt":"Outline a weekend hiking plan in Snowdonia with safety tips.",
        "num_gens": 6,
        "think": true,
        "max_tokens": 400,
        "temperature": 0.8,
        "top_p": 0.9,
        "pass_score": 9.0,
        "do_synthesis": true,
        "judge": true,
        "return_trace": true
      }'

A prettier example (and sample outputs) lives in example_request.md.


Docker Compose notes

The vllm service is configured for Mamba/SSM models and 24 GB GPUs:

  • --dtype float16 (stable on RTX 3090 / Ampere)
  • --mamba-ssm-cache-dtype float32 (quality knob for Nemotron-Nano v2)
  • --max-model-len defaults small for stability; raise later as needed
  • --gpu-memory-utilization tuned (0.85–0.92) to leave room for KV cache
  • optional --cpu-offload-gb to push part of the cache to CPU RAM if you need longer contexts on a 24 GB card

The compose also mounts caches:

  • hf-cache/root/.cache/huggingface (weights)
  • torch-cache/root/.cache/torch (compiled kernels)
  • triton-cache/root/.triton (compiler cache)

These volumes persist between restarts so you don’t redownload/recompile.


Pro-Mode API (server.py)

Endpoint

  • POST /pro-mode → structured JSON result
  • GET /health{"status":"ok"}

Request schema (key fields)

{
  "prompt": "your task",
  "num_gens": 8,            // number of candidates to try (serial)
  "think": true,            // use Nemotron's /think toggle for candidates
  "max_tokens": 512,
  "temperature": 0.8,
  "top_p": 0.9,
  "pass_score": 8.5,        // early-stop threshold (judge score)
  "do_synthesis": true,     // run final merge
  "judge": true,            // self-judge for early-stop + ranking
  "seed0": 42,              // seeds increment per run
  "return_trace": false     // include candidates & thoughts if true
}

What it does

  • Fan-out serially: generate one candidate at a time (streaming disabled for simplicity).

  • Judge + early-stop: after each candidate, a small rubric scores 1..10. If >= pass_score, stop early.

  • Synthesis: take the top K (default 5) answers (not thoughts) and call a final editor-style merge:

    • System: “You are an expert editor… Do not mention the candidates or the synthesis process.”
    • We force /no_think in synthesis so the final answer is concise.
    • A tiny sanitizer trims any leftover meta preface.
  • Trace (optional): when return_trace=true you get, per candidate:

    • seed, thoughts (from <think>…</think>), answer (cleaned), score,
    • plus synthesis used_indices, raw (full text), and thoughts if present.

Response shape (abridged)

{
  "final_answer": "...",
  "used_candidates": 3,
  "scores": [9.1, 8.6, 9.0],
  "candidates": ["raw text with <think>...</think>", "..."],   // only if return_trace=true
  "trace": {
    "candidates": [{ "seed": 42, "thoughts": "...", "answer": "...", "score": 9.1 }, ...],
    "synthesis": { "used_indices": [0,2], "thoughts": null, "raw": "..." }
  }
}

Tuning guide (RTX 3090, 24 GB)

Start conservative; then scale up:

  • --dtype float16
  • --max-model-len 8192 → 16384 → 32768 (only raise if your prompts need it)
  • --max-num-seqs 1 → 2
  • --gpu-memory-utilization 0.85–0.92
  • --mamba-ssm-cache-dtype float32 (recommended by the model card)
  • Optional: --cpu-offload-gb 6–10 to gain KV headroom (latency trade-off)
  • Bring chunked prefill back later by removing --no-enable-chunked-prefill once stable

Why KV cache fails (common error): Weights (~16.6 GiB) + activations can leave ~0 GiB for KV at high contexts. Reduce --max-model-len or add --cpu-offload-gb.


Monitoring & troubleshooting

  • Watch GPU (inside WSL or container):

    watch -n 0.5 nvidia-smi
    # or
    nvidia-smi --query-gpu=memory.total,memory.used,utilization.gpu --format=csv -l 1
  • Server ready?

    curl http://localhost:9090/health
    curl http://localhost:9090/v1/models
  • Typical fixes

    • “No available memory for the cache blocks”: lower --max-model-len, raise --gpu-memory-utilization, or add --cpu-offload-gb.
    • Connection reset during startup: the model is still loading; wait for “serving” lines in docker logs -f vllm.
    • Slow first run: that’s weight download + kernel compile. Volumes cache both for next time.

Design choices (why this)

  • Serial best-of-N gives you an anytime algorithm with early-stop → often fewer total tokens vs parallel fan-out.
  • Judge is cheap but useful: it gates synthesis and selects top-K.
  • Editor-style synthesis yields a crisp final answer (no “I merged X & Y…” narration), while trace preserves all thoughts.

Customize

  • Switch synthesis style to allow an explanation (set a flag to use /think and a “resolve & explain” prompt).
  • Add mini-batch parallelism for candidates with asyncio.gather if you want lower latency.
  • Tighten the synthesis temperature to 0.1 for maximum determinism.

Credits

  • Model: NVIDIA/Nemotron-Nano-9B-v2
  • Inference: vLLM (OpenAI-compatible server)
  • Wrapper concept inspired by Matt Shumer’s GPT Pro-Mode (multi-sample + synth).

License

This repo’s code is under your chosen license (add one). Respect the licenses/usage terms of the underlying model (NVIDIA) and tools (vLLM, FastAPI, etc.).


Next steps

  • Try the example_request.md curl and compare the “clean final answer” vs. “trace with thoughts.”
  • Plug the /pro-mode endpoint into your agent’s planner/coder loop (use pass_score and num_gens as quality/cost knobs).

About

Inspired by Matt Shumer's "GPT Pro-Mode", runs nemotron locally on a small gpu (RTX 3090) and makes multiple requests in series and synthesizes the results.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published