Skip to content

Commit d596339

Browse files
authored
[doc] improve readability for long commands (vllm-project#19920)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>
1 parent a137d4e commit d596339

File tree

3 files changed

+48
-9
lines changed

3 files changed

+48
-9
lines changed

docs/contributing/profiling.md

Lines changed: 33 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,21 @@ Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example
3030
#### OpenAI Server
3131

3232
```bash
33-
VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
33+
VLLM_TORCH_PROFILER_DIR=./vllm_profile \
34+
python -m vllm.entrypoints.openai.api_server \
35+
--model meta-llama/Meta-Llama-3-70B
3436
```
3537

3638
benchmark_serving.py:
3739

3840
```bash
39-
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2
41+
python benchmarks/benchmark_serving.py \
42+
--backend vllm \
43+
--model meta-llama/Meta-Llama-3-70B \
44+
--dataset-name sharegpt \
45+
--dataset-path sharegpt.json \
46+
--profile \
47+
--num-prompts 2
4048
```
4149

4250
## Profile with NVIDIA Nsight Systems
@@ -64,7 +72,16 @@ For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fo
6472
The following is an example using the `benchmarks/benchmark_latency.py` script:
6573

6674
```bash
67-
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --num-iters-warmup 5 --num-iters 1 --batch-size 16 --input-len 512 --output-len 8
75+
nsys profile -o report.nsys-rep \
76+
--trace-fork-before-exec=true \
77+
--cuda-graph-trace=node \
78+
python benchmarks/benchmark_latency.py \
79+
--model meta-llama/Llama-3.1-8B-Instruct \
80+
--num-iters-warmup 5 \
81+
--num-iters 1 \
82+
--batch-size 16 \
83+
--input-len 512 \
84+
--output-len 8
6885
```
6986

7087
#### OpenAI Server
@@ -73,10 +90,21 @@ To profile the server, you will want to prepend your `vllm serve` command with `
7390

7491
```bash
7592
# server
76-
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node --delay 30 --duration 60 vllm serve meta-llama/Llama-3.1-8B-Instruct
93+
nsys profile -o report.nsys-rep \
94+
--trace-fork-before-exec=true \
95+
--cuda-graph-trace=node \
96+
--delay 30 \
97+
--duration 60 \
98+
vllm serve meta-llama/Llama-3.1-8B-Instruct
7799

78100
# client
79-
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1 --dataset-name random --random-input 1024 --random-output 512
101+
python benchmarks/benchmark_serving.py \
102+
--backend vllm \
103+
--model meta-llama/Llama-3.1-8B-Instruct \
104+
--num-prompts 1 \
105+
--dataset-name random \
106+
--random-input 1024 \
107+
--random-output 512
80108
```
81109

82110
In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:

docs/getting_started/installation/cpu.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,9 @@ Currently, there are no pre-built CPU wheels.
7979
??? Commands
8080

8181
```console
82-
$ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
82+
$ docker build -f docker/Dockerfile.cpu \
83+
--tag vllm-cpu-env \
84+
--target vllm-openai .
8385

8486
# Launching OpenAI server
8587
$ docker run --rm \
@@ -188,13 +190,19 @@ vllm serve facebook/opt-125m
188190
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
189191

190192
```console
191-
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
193+
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \
194+
vllm serve meta-llama/Llama-2-7b-chat-hf \
195+
-tp=2 \
196+
--distributed-executor-backend mp
192197
```
193198

194199
or using default auto thread binding:
195200

196201
```console
197-
VLLM_CPU_KVCACHE_SPACE=40 vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
202+
VLLM_CPU_KVCACHE_SPACE=40 \
203+
vllm serve meta-llama/Llama-2-7b-chat-hf \
204+
-tp=2 \
205+
--distributed-executor-backend mp
198206
```
199207

200208
- For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node.

docs/usage/troubleshooting.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,10 @@ NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
134134
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
135135

136136
```console
137-
NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py
137+
NCCL_DEBUG=TRACE torchrun --nnodes 2 \
138+
--nproc-per-node=2 \
139+
--rdzv_backend=c10d \
140+
--rdzv_endpoint=$MASTER_ADDR test.py
138141
```
139142

140143
If the script runs successfully, you should see the message `sanity check is successful!`.

0 commit comments

Comments
 (0)