Spin-up production-ready API endpoints for open-source Large Language Models in seconds on Runpod.
Every link in this repository opens the Runpod console with a fully-configured template selected - just choose a GPU, press Deploy and start prompting.
- Llama 3.1 8B Instruct Deploy on Runpod
- Qwen3-30B-A3B-FP8 - SGLang Deploy on Runpod
- Qwen3-32B-FP8 - SGLang Deploy on Runpod
- Qwen3-235B-A22B-FP8 - SGLang Deploy on Runpod
- DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic Deploy on Runpod
More models will be added soon.
Resource | Minimum | Recommended |
---|---|---|
Pod Disk | 40 GB | 60 GB+ (big models) |
VRAM | 24 GB (8 B) / 48 GB (70 B) | See GPU table |
Arch | Ampere+ (A40/A100/H100) | Ada/Hopper for FP8 |
We have custom Docker images for each inference engine alongside the Runpod templates.
- Images live in the
docker/
folder (docker/vllm-base
,docker/llamacpp
, etc.). - They extend official base images (e.g.
vllm/vllm-openai
) with:- Faster model-download tooling (
hf_transfer
) - Hardened startup scripts (health checks, graceful shutdown)
- Extra libraries for specialised models (audio, vision).
- Faster model-download tooling (
- Each sub-folder contains its own README with build instructions:
# example
cd docker/vllm-base
docker build -t myorg/vllm-base:latest .
Using these images keeps pods reproducible and lets us apply optimisations once, then reuse them across all templates.
- Click a One-Click Link above.
- Log in or create a Runpod account.
- Select a GPU meeting the “Minimum GPU” column.
- (Optional) add
HUGGING_FACE_HUB_TOKEN
for gated models. - Press Deploy Pod.
- Wait for weights to download and the server to start (~1-2 min).
- Your endpoint will be:
https://<POD_ID>-8000.proxy.runpod.net
ENDPOINT="https://<POD_ID>-8000.proxy.runpod.net"
curl -X POST "$ENDPOINT/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user","content": "Hello, who are you?"}],
"max_tokens": 100
}'
Built with ❤ to make self-hosting state-of-the-art models effortless.