Skip to content

Commit 138709f

Browse files
bigPYJ1151gemini-code-assist[bot]hmellor
authored
[Doc] Update CPU doc (#20676)
Signed-off-by: jiang1.li <jiang1.li@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
1 parent 0bbac1c commit 138709f

File tree

5 files changed

+100
-85
lines changed

5 files changed

+100
-85
lines changed

docs/getting_started/installation/cpu.md

Lines changed: 35 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -76,78 +76,56 @@ Currently, there are no pre-built CPU wheels.
7676

7777
### Build image from source
7878

79-
??? console "Commands"
79+
=== "Intel/AMD x86"
8080

81-
```bash
82-
docker build -f docker/Dockerfile.cpu \
83-
--tag vllm-cpu-env \
84-
--target vllm-openai .
85-
86-
# Launching OpenAI server
87-
docker run --rm \
88-
--privileged=true \
89-
--shm-size=4g \
90-
-p 8000:8000 \
91-
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
92-
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
93-
vllm-cpu-env \
94-
--model=meta-llama/Llama-3.2-1B-Instruct \
95-
--dtype=bfloat16 \
96-
other vLLM OpenAI server arguments
97-
```
81+
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-image-from-source"
9882

99-
!!! tip
100-
For ARM or Apple silicon, use `docker/Dockerfile.arm`
83+
=== "ARM AArch64"
10184

102-
!!! tip
103-
For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float`
85+
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
10486

105-
## Supported features
87+
=== "Apple silicon"
10688

107-
vLLM CPU backend supports the following vLLM features:
89+
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
10890

109-
- Tensor Parallel
110-
- Model Quantization (`INT8 W8A8, AWQ, GPTQ`)
111-
- Chunked-prefill
112-
- Prefix-caching
113-
- FP8-E5M2 KV cache
91+
=== "IBM Z (S390X)"
92+
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-image-from-source"
11493

11594
## Related runtime environment variables
11695

11796
- `VLLM_CPU_KVCACHE_SPACE`: specify the KV Cache size (e.g, `VLLM_CPU_KVCACHE_SPACE=40` means 40 GiB space for KV cache), larger setting will allow vLLM running more requests in parallel. This parameter should be set based on the hardware configuration and memory management pattern of users. Default value is `0`.
11897
- `VLLM_CPU_OMP_THREADS_BIND`: specify the CPU cores dedicated to the OpenMP threads. For example, `VLLM_CPU_OMP_THREADS_BIND=0-31` means there will be 32 OpenMP threads bound on 0-31 CPU cores. `VLLM_CPU_OMP_THREADS_BIND=0-31|32-63` means there will be 2 tensor parallel processes, 32 OpenMP threads of rank0 are bound on 0-31 CPU cores, and the OpenMP threads of rank1 are bound on 32-63 CPU cores. By setting to `auto`, the OpenMP threads of each rank are bound to the CPU cores in each NUMA node. By setting to `all`, the OpenMP threads of each rank uses all CPU cores available on the system. Default value is `auto`.
11998
- `VLLM_CPU_NUM_OF_RESERVED_CPU`: specify the number of CPU cores which are not dedicated to the OpenMP threads for each rank. The variable only takes effect when VLLM_CPU_OMP_THREADS_BIND is set to `auto`. Default value is `0`.
120-
- `VLLM_CPU_MOE_PREPACK`: whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
121-
- `VLLM_CPU_SGL_KERNEL` (Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False).
99+
- `VLLM_CPU_MOE_PREPACK` (x86 only): whether to use prepack for MoE layer. This will be passed to `ipex.llm.modules.GatedMLPMOE`. Default is `1` (True). On unsupported CPUs, you might need to set this to `0` (False).
100+
- `VLLM_CPU_SGL_KERNEL` (x86 only, Experimental): whether to use small-batch optimized kernels for linear layer and MoE layer, especially for low-latency requirements like online serving. The kernels require AMX instruction set, BFloat16 weight type and weight shapes divisible by 32. Default is `0` (False).
122101

123-
## Performance tips
102+
## FAQ
124103

125-
- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
104+
### Which `dtype` should be used?
126105

127-
```bash
128-
sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
129-
find / -name *libtcmalloc* # find the dynamic link library path
130-
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
131-
python examples/offline_inference/basic/basic.py # run vLLM
132-
```
106+
- Currently vLLM CPU uses model default settings as `dtype`. However, due to unstable float16 support in torch CPU, it is recommended to explicitly set `dtype=bfloat16` if there are any performance or accuracy problem.
133107

134-
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
108+
### How to launch a vLLM service on CPU?
109+
110+
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 31 for the framework and using CPU 0-30 for inference threads:
135111

136112
```bash
137113
export VLLM_CPU_KVCACHE_SPACE=40
138-
export VLLM_CPU_OMP_THREADS_BIND=0-29
139-
vllm serve facebook/opt-125m
114+
export VLLM_CPU_OMP_THREADS_BIND=0-30
115+
vllm serve facebook/opt-125m --dtype=bfloat16
140116
```
141117

142118
or using default auto thread binding:
143119

144120
```bash
145121
export VLLM_CPU_KVCACHE_SPACE=40
146-
export VLLM_CPU_NUM_OF_RESERVED_CPU=2
147-
vllm serve facebook/opt-125m
122+
export VLLM_CPU_NUM_OF_RESERVED_CPU=1
123+
vllm serve facebook/opt-125m --dtype=bfloat16
148124
```
149125

150-
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND` or using auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
126+
### How to decide `VLLM_CPU_OMP_THREADS_BIND`?
127+
128+
- Bind each OpenMP thread to a dedicated physical CPU core respectively, or use auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
151129

152130
??? console "Commands"
153131

@@ -178,34 +156,21 @@ vllm serve facebook/opt-125m
178156
$ python examples/offline_inference/basic/basic.py
179157
```
180158

181-
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access.
182-
183-
## Other considerations
184-
185-
- The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance.
159+
- When deploy vLLM CPU backend on a multi-socket machine with NUMA and enable tensor parallel or pipeline parallel, each NUMA node is treated as a TP/PP rank. So be aware to set CPU cores of a single rank on a same NUMA node to avoid cross NUMA node memory access.
186160

187-
- Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance.
161+
### How to decide `VLLM_CPU_KVCACHE_SPACE`?
188162

189-
- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa). For NUMA architecture, Tensor Parallel is a option for better performance.
163+
- This value is 4GB by default. Larger space can support more concurrent requests, longer context length. However, users should take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, the TP worker will be killed with `exitcode 9` due to out-of-memory.
190164

191-
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
165+
### Which quantization configs does vLLM CPU support?
192166

193-
```bash
194-
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \
195-
vllm serve meta-llama/Llama-2-7b-chat-hf \
196-
-tp=2 \
197-
--distributed-executor-backend mp
198-
```
199-
200-
or using default auto thread binding:
201-
202-
```bash
203-
VLLM_CPU_KVCACHE_SPACE=40 \
204-
vllm serve meta-llama/Llama-2-7b-chat-hf \
205-
-tp=2 \
206-
--distributed-executor-backend mp
207-
```
167+
- vLLM CPU supports quantizations:
168+
- AWQ (x86 only)
169+
- GPTQ (x86 only)
170+
- compressed-tensor INT8 W8A8 (x86, s390x)
208171

209-
- For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node.
172+
### (x86 only) What is the purpose of `VLLM_CPU_MOE_PREPACK` and `VLLM_CPU_SGL_KERNEL`?
210173

211-
- Meanwhile, users should also take care of memory capacity of each NUMA node. The memory usage of each TP rank is the sum of `weight shard size` and `VLLM_CPU_KVCACHE_SPACE`, if it exceeds the capacity of a single NUMA node, TP worker will be killed due to out-of-memory.
174+
- Both of them requires `amx` CPU flag.
175+
- `VLLM_CPU_MOE_PREPACK` can provides better performance for MoE models
176+
- `VLLM_CPU_SGL_KERNEL` can provides better performance for MoE models and small-batch scenarios.

docs/getting_started/installation/cpu/arm.inc.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,22 @@ Testing has been conducted on AWS Graviton3 instances for compatibility.
3232

3333
# --8<-- [end:pre-built-images]
3434
# --8<-- [start:build-image-from-source]
35-
35+
```bash
36+
docker build -f docker/Dockerfile.arm \
37+
--tag vllm-cpu-env .
38+
39+
# Launching OpenAI server
40+
docker run --rm \
41+
--privileged=true \
42+
--shm-size=4g \
43+
-p 8000:8000 \
44+
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
45+
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
46+
vllm-cpu-env \
47+
--model=meta-llama/Llama-3.2-1B-Instruct \
48+
--dtype=bfloat16 \
49+
other vLLM OpenAI server arguments
50+
```
3651
# --8<-- [end:build-image-from-source]
3752
# --8<-- [start:extra-information]
3853
# --8<-- [end:extra-information]

docs/getting_started/installation/cpu/build.inc.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as
22

33
```bash
44
sudo apt-get update -y
5-
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
5+
sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof
66
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
77
```
88

@@ -17,7 +17,7 @@ Third, install Python packages for vLLM CPU backend building:
1717

1818
```bash
1919
pip install --upgrade pip
20-
pip install "cmake>=3.26.1" wheel packaging ninja "setuptools-scm>=8" numpy
20+
pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu
2121
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
2222
```
2323

@@ -33,4 +33,7 @@ If you want to develop vllm, install it in editable mode instead.
3333
VLLM_TARGET_DEVICE=cpu python setup.py develop
3434
```
3535

36+
!!! note
37+
If you are building vLLM from source and not using the pre-built images, remember to set `LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD"` on x86 machines before running vLLM.
38+
3639
# --8<-- [end:extra-information]

docs/getting_started/installation/cpu/s390x.inc.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,23 @@ Execute the following commands to build and install vLLM from the source.
6161
# --8<-- [end:pre-built-images]
6262
# --8<-- [start:build-image-from-source]
6363

64+
```bash
65+
docker build -f docker/Dockerfile.s390x \
66+
--tag vllm-cpu-env .
67+
68+
# Launching OpenAI server
69+
docker run --rm \
70+
--privileged=true \
71+
--shm-size=4g \
72+
-p 8000:8000 \
73+
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
74+
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
75+
vllm-cpu-env \
76+
--model=meta-llama/Llama-3.2-1B-Instruct \
77+
--dtype=float \
78+
other vLLM OpenAI server arguments
79+
```
80+
6481
# --8<-- [end:build-image-from-source]
6582
# --8<-- [start:extra-information]
6683
# --8<-- [end:extra-information]
Lines changed: 27 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,15 @@
11
# --8<-- [start:installation]
22

3-
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
4-
5-
!!! warning
6-
There are no pre-built wheels or images for this device, so you must build vLLM from source.
3+
vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
74

85
# --8<-- [end:installation]
96
# --8<-- [start:requirements]
107

118
- OS: Linux
12-
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
13-
- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
9+
- CPU flags: `avx512f`, `avx512_bf16` (Optional), `avx512_vnni` (Optional)
1410

1511
!!! tip
16-
[Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
12+
Use `lscpu` to check the CPU flags.
1713

1814
# --8<-- [end:requirements]
1915
# --8<-- [start:set-up-using-python]
@@ -26,18 +22,37 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
2622

2723
--8<-- "docs/getting_started/installation/cpu/build.inc.md"
2824

29-
!!! note
30-
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
31-
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
32-
3325
# --8<-- [end:build-wheel-from-source]
3426
# --8<-- [start:pre-built-images]
3527

36-
See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
28+
[https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
29+
30+
!!! warning
31+
If deploying the pre-built images on machines only contain `avx512f`, `Illegal instruction` error may be raised. It is recommended to build images for these machines with `--build-arg VLLM_CPU_AVX512BF16=false` and `--build-arg VLLM_CPU_AVX512VNNI=false`.
3732

3833
# --8<-- [end:pre-built-images]
3934
# --8<-- [start:build-image-from-source]
4035

36+
```bash
37+
docker build -f docker/Dockerfile.cpu \
38+
--build-arg VLLM_CPU_AVX512BF16=false (default)|true \
39+
--build-arg VLLM_CPU_AVX512VNNI=false (default)|true \
40+
--tag vllm-cpu-env \
41+
--target vllm-openai .
42+
43+
# Launching OpenAI server
44+
docker run --rm \
45+
--privileged=true \
46+
--shm-size=4g \
47+
-p 8000:8000 \
48+
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
49+
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
50+
vllm-cpu-env \
51+
--model=meta-llama/Llama-3.2-1B-Instruct \
52+
--dtype=bfloat16 \
53+
other vLLM OpenAI server arguments
54+
```
55+
4156
# --8<-- [end:build-image-from-source]
4257
# --8<-- [start:extra-information]
4358
# --8<-- [end:extra-information]

0 commit comments

Comments
 (0)