You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Benchmark] Refactor perf script to use benchmark cli (#1524)
### What this PR does / why we need it?
Since, `vllm bench` cli has optimized enough for production use(support
more datasets), we are now do not need to copy vllm codes, now , with
vllm installed, we can easily use the benchmark cli
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
CI passed
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
2
+
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
3
3
4
4
# Overview
5
5
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
21
21
- Output length: the corresponding output length of these 200 prompts.
22
22
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
23
23
-**Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
26
26
27
27
**Benchmarking Duration**: about 800 senond for single model.
@@ -38,20 +38,129 @@ Before running the benchmarks, ensure the following:
38
38
pip install -r benchmarks/requirements-bench.txt
39
39
```
40
40
41
-
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time. feel free to add your own models and parameters in the JSON to run your customized benchmarks.
41
+
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
42
+
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
76
+
77
+
-**Test Overview**
78
+
- Test Name: serving_qwen2_5vl_7B_tp1
79
+
80
+
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
81
+
82
+
- Server Parameters
83
+
- Model: Qwen/Qwen2.5-VL-7B-Instruct
84
+
85
+
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
86
+
87
+
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
88
+
89
+
- disable_log_stats: disables logging of performance statistics.
90
+
91
+
- disable_log_requests: disables logging of individual requests.
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
96
+
97
+
- Client Parameters
98
+
99
+
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
100
+
101
+
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
102
+
103
+
- Dataset Source: Hugging Face (hf)
104
+
105
+
- Dataset Split: train
106
+
107
+
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
108
+
109
+
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
110
+
111
+
- Number of Prompts: 200 (the total number of prompts used during the test)
112
+
113
+
42
114
43
115
## Run benchmarks
116
+
117
+
### Use benchmark script
44
118
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
0 commit comments