Skip to content

Commit 3eb08ed

Browse files
[DOC] Add Kubernetes deployment guide with CPUs (vllm-project#14865)
1 parent 5eeadc2 commit 3eb08ed

File tree

2 files changed

+103
-3
lines changed

2 files changed

+103
-3
lines changed

docs/source/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@
8585
html_js_files = ["custom.js"]
8686
html_css_files = ["custom.css"]
8787

88+
myst_heading_anchors = 2
8889
myst_url_schemes = {
8990
'http': None,
9091
'https': None,

docs/source/deployment/k8s.md

Lines changed: 102 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44

55
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.
66

7+
* [Deployment with CPUs](#deployment-with-cpus)
8+
* [Deployment with GPUs](#deployment-with-gpus)
9+
710
Alternatively, you can deploy vLLM to Kubernetes using any of the following:
811
* [Helm](frameworks/helm.md)
912
* [InftyAI/llmaz](integrations/llmaz.md)
@@ -14,11 +17,107 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
1417
* [vllm-project/aibrix](https://github.com/vllm-project/aibrix)
1518
* [vllm-project/production-stack](integrations/production-stack.md)
1619

17-
## Pre-requisite
20+
## Deployment with CPUs
21+
22+
:::{note}
23+
The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs.
24+
:::
25+
26+
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
27+
28+
```bash
29+
cat <<EOF |kubectl apply -f -
30+
apiVersion: v1
31+
kind: PersistentVolumeClaim
32+
metadata:
33+
name: vllm-models
34+
spec:
35+
accessModes:
36+
- ReadWriteOnce
37+
volumeMode: Filesystem
38+
resources:
39+
requests:
40+
storage: 50Gi
41+
---
42+
apiVersion: v1
43+
kind: Secret
44+
metadata:
45+
name: hf-token-secret
46+
type: Opaque
47+
data:
48+
token: $(HF_TOKEN)
49+
```
50+
51+
Next, start the vLLM server as a Kubernetes Deployment and Service:
52+
53+
```bash
54+
cat <<EOF |kubectl apply -f -
55+
apiVersion: apps/v1
56+
kind: Deployment
57+
metadata:
58+
name: vllm-server
59+
spec:
60+
replicas: 1
61+
selector:
62+
matchLabels:
63+
app.kubernetes.io/name: vllm
64+
template:
65+
metadata:
66+
labels:
67+
app.kubernetes.io/name: vllm
68+
spec:
69+
containers:
70+
- name: vllm
71+
image: vllm/vllm-openai:latest
72+
command: ["/bin/sh", "-c"]
73+
args: [
74+
"vllm serve meta-llama/Llama-3.2-1B-Instruct"
75+
]
76+
env:
77+
- name: HUGGING_FACE_HUB_TOKEN
78+
valueFrom:
79+
secretKeyRef:
80+
name: hf-token-secret
81+
key: token
82+
ports:
83+
- containerPort: 8000
84+
volumeMounts:
85+
- name: llama-storage
86+
mountPath: /root/.cache/huggingface
87+
volumes:
88+
- name: llama-storage
89+
persistentVolumeClaim:
90+
claimName: vllm-models
91+
---
92+
apiVersion: v1
93+
kind: Service
94+
metadata:
95+
name: vllm-server
96+
spec:
97+
selector:
98+
app.kubernetes.io/name: vllm
99+
ports:
100+
- protocol: TCP
101+
port: 8000
102+
targetPort: 8000
103+
type: ClusterIP
104+
EOF
105+
```
106+
107+
We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
108+
109+
```console
110+
kubectl logs -l app.kubernetes.io/name=vllm
111+
...
112+
INFO: Started server process [1]
113+
INFO: Waiting for application startup.
114+
INFO: Application startup complete.
115+
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
116+
```
18117

19-
Ensure that you have a running [Kubernetes cluster with GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/).
118+
## Deployment with GPUs
20119

21-
## Deployment using native K8s
120+
**Pre-requisite**: Ensure that you have a running [Kubernetes cluster with GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/).
22121

23122
1. Create a PVC, Secret and Deployment for vLLM
24123

0 commit comments

Comments
 (0)