@@ -30,13 +30,21 @@ Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example
30
30
#### OpenAI Server
31
31
32
32
``` bash
33
- VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
33
+ VLLM_TORCH_PROFILER_DIR=./vllm_profile \
34
+ python -m vllm.entrypoints.openai.api_server \
35
+ --model meta-llama/Meta-Llama-3-70B
34
36
```
35
37
36
38
benchmark_serving.py:
37
39
38
40
``` bash
39
- python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2
41
+ python benchmarks/benchmark_serving.py \
42
+ --backend vllm \
43
+ --model meta-llama/Meta-Llama-3-70B \
44
+ --dataset-name sharegpt \
45
+ --dataset-path sharegpt.json \
46
+ --profile \
47
+ --num-prompts 2
40
48
```
41
49
42
50
## Profile with NVIDIA Nsight Systems
@@ -64,7 +72,16 @@ For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fo
64
72
The following is an example using the ` benchmarks/benchmark_latency.py ` script:
65
73
66
74
``` bash
67
- nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --num-iters-warmup 5 --num-iters 1 --batch-size 16 --input-len 512 --output-len 8
75
+ nsys profile -o report.nsys-rep \
76
+ --trace-fork-before-exec=true \
77
+ --cuda-graph-trace=node \
78
+ python benchmarks/benchmark_latency.py \
79
+ --model meta-llama/Llama-3.1-8B-Instruct \
80
+ --num-iters-warmup 5 \
81
+ --num-iters 1 \
82
+ --batch-size 16 \
83
+ --input-len 512 \
84
+ --output-len 8
68
85
```
69
86
70
87
#### OpenAI Server
@@ -73,10 +90,21 @@ To profile the server, you will want to prepend your `vllm serve` command with `
73
90
74
91
``` bash
75
92
# server
76
- nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node --delay 30 --duration 60 vllm serve meta-llama/Llama-3.1-8B-Instruct
93
+ nsys profile -o report.nsys-rep \
94
+ --trace-fork-before-exec=true \
95
+ --cuda-graph-trace=node \
96
+ --delay 30 \
97
+ --duration 60 \
98
+ vllm serve meta-llama/Llama-3.1-8B-Instruct
77
99
78
100
# client
79
- python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1 --dataset-name random --random-input 1024 --random-output 512
101
+ python benchmarks/benchmark_serving.py \
102
+ --backend vllm \
103
+ --model meta-llama/Llama-3.1-8B-Instruct \
104
+ --num-prompts 1 \
105
+ --dataset-name random \
106
+ --random-input 1024 \
107
+ --random-output 512
80
108
```
81
109
82
110
In practice, you should set the ` --duration ` argument to a large value. Whenever you want the server to stop profiling, run:
0 commit comments