Skip to content

Commit 848e041

Browse files
Using EvalScope evaluation (#611)
### What this PR does / why we need it? Using EvalScope to hava a evaluation (include eval and test): - https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage - https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally --------- Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
1 parent 4a0ce36 commit 848e041

File tree

2 files changed

+175
-1
lines changed

2 files changed

+175
-1
lines changed

docs/source/developer_guide/evaluation/index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,5 @@
55
:maxdepth: 1
66
using_opencompass
77
using_lm_eval
8-
:::
8+
using_evalscope
9+
:::
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# Using EvalScope
2+
3+
This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope).
4+
5+
## 1. Online serving
6+
7+
You can run docker container to start the vLLM server on a single NPU:
8+
9+
```{code-block} bash
10+
:substitutions:
11+
# Update DEVICE according to your device (/dev/davinci[0-7])
12+
export DEVICE=/dev/davinci7
13+
# Update the vllm-ascend image
14+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
15+
docker run --rm \
16+
--name vllm-ascend \
17+
--device $DEVICE \
18+
--device /dev/davinci_manager \
19+
--device /dev/devmm_svm \
20+
--device /dev/hisi_hdc \
21+
-v /usr/local/dcmi:/usr/local/dcmi \
22+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
23+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
24+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
25+
-v /etc/ascend_install.info:/etc/ascend_install.info \
26+
-v /root/.cache:/root/.cache \
27+
-p 8000:8000 \
28+
-e VLLM_USE_MODELSCOPE=True \
29+
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
30+
-it $IMAGE \
31+
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
32+
```
33+
34+
If your service start successfully, you can see the info shown below:
35+
36+
```
37+
INFO: Started server process [6873]
38+
INFO: Waiting for application startup.
39+
INFO: Application startup complete.
40+
```
41+
42+
Once your server is started, you can query the model with input prompts in new terminal:
43+
44+
```
45+
curl http://localhost:8000/v1/completions \
46+
-H "Content-Type: application/json" \
47+
-d '{
48+
"model": "Qwen/Qwen2.5-7B-Instruct",
49+
"prompt": "The future of AI is",
50+
"max_tokens": 7,
51+
"temperature": 0
52+
}'
53+
```
54+
55+
## 2. Install EvalScope using pip
56+
57+
You can install EvalScope by using:
58+
59+
```bash
60+
python3 -m venv .venv-evalscope
61+
source .venv-evalscope/bin/activate
62+
pip install gradio plotly evalscope
63+
```
64+
65+
## 3. Run gsm8k accuracy test using EvalScope
66+
67+
You can `evalscope eval` run gsm8k accuracy test:
68+
```
69+
evalscope eval \
70+
--model Qwen/Qwen2.5-7B-Instruct \
71+
--api-url http://localhost:8000/v1 \
72+
--api-key EMPTY \
73+
--eval-type service \
74+
--datasets gsm8k \
75+
--limit 10
76+
```
77+
78+
After 1-2 mins, the output is as shown below:
79+
80+
```shell
81+
+---------------------+-----------+-----------------+----------+-------+---------+---------+
82+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
83+
+=====================+===========+=================+==========+=======+=========+=========+
84+
| Qwen2.5-7B-Instruct | gsm8k | AverageAccuracy | main | 10 | 0.8 | default |
85+
+---------------------+-----------+-----------------+----------+-------+---------+---------+
86+
```
87+
88+
See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation).
89+
90+
## 4. Run model inference stress testing using EvalScope
91+
92+
### Install EvalScope[perf] using pip
93+
94+
```shell
95+
pip install evalscope[perf] -U
96+
```
97+
98+
### Basic usage
99+
100+
You can use `evalscope perf` run perf test:
101+
```
102+
evalscope perf \
103+
--url "http://localhost:8000/v1/chat/completions" \
104+
--parallel 5 \
105+
--model Qwen/Qwen2.5-7B-Instruct \
106+
--number 20 \
107+
--api openai \
108+
--dataset openqa \
109+
--stream
110+
```
111+
112+
### Output results
113+
114+
After 1-2 mins, the output is as shown below:
115+
116+
```shell
117+
Benchmarking summary:
118+
+-----------------------------------+---------------------------------------------------------------+
119+
| Key | Value |
120+
+===================================+===============================================================+
121+
| Time taken for tests (s) | 38.3744 |
122+
+-----------------------------------+---------------------------------------------------------------+
123+
| Number of concurrency | 5 |
124+
+-----------------------------------+---------------------------------------------------------------+
125+
| Total requests | 20 |
126+
+-----------------------------------+---------------------------------------------------------------+
127+
| Succeed requests | 20 |
128+
+-----------------------------------+---------------------------------------------------------------+
129+
| Failed requests | 0 |
130+
+-----------------------------------+---------------------------------------------------------------+
131+
| Output token throughput (tok/s) | 132.6926 |
132+
+-----------------------------------+---------------------------------------------------------------+
133+
| Total token throughput (tok/s) | 158.8819 |
134+
+-----------------------------------+---------------------------------------------------------------+
135+
| Request throughput (req/s) | 0.5212 |
136+
+-----------------------------------+---------------------------------------------------------------+
137+
| Average latency (s) | 8.3612 |
138+
+-----------------------------------+---------------------------------------------------------------+
139+
| Average time to first token (s) | 0.1035 |
140+
+-----------------------------------+---------------------------------------------------------------+
141+
| Average time per output token (s) | 0.0329 |
142+
+-----------------------------------+---------------------------------------------------------------+
143+
| Average input tokens per request | 50.25 |
144+
+-----------------------------------+---------------------------------------------------------------+
145+
| Average output tokens per request | 254.6 |
146+
+-----------------------------------+---------------------------------------------------------------+
147+
| Average package latency (s) | 0.0324 |
148+
+-----------------------------------+---------------------------------------------------------------+
149+
| Average package per request | 254.6 |
150+
+-----------------------------------+---------------------------------------------------------------+
151+
| Expected number of requests | 20 |
152+
+-----------------------------------+---------------------------------------------------------------+
153+
| Result DB path | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db |
154+
+-----------------------------------+---------------------------------------------------------------+
155+
156+
Percentile results:
157+
+------------+----------+---------+-------------+--------------+---------------+----------------------+
158+
| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
159+
+------------+----------+---------+-------------+--------------+---------------+----------------------+
160+
| 10% | 0.0962 | 0.031 | 4.4571 | 42 | 135 | 29.9767 |
161+
| 25% | 0.0971 | 0.0318 | 6.3509 | 47 | 193 | 30.2157 |
162+
| 50% | 0.0987 | 0.0321 | 9.3387 | 49 | 285 | 30.3969 |
163+
| 66% | 0.1017 | 0.0324 | 9.8519 | 52 | 302 | 30.5182 |
164+
| 75% | 0.107 | 0.0328 | 10.2391 | 55 | 313 | 30.6124 |
165+
| 80% | 0.1221 | 0.0329 | 10.8257 | 58 | 330 | 30.6759 |
166+
| 90% | 0.1245 | 0.0333 | 13.0472 | 62 | 404 | 30.9644 |
167+
| 95% | 0.1247 | 0.0336 | 14.2936 | 66 | 432 | 31.6691 |
168+
| 98% | 0.1247 | 0.0353 | 14.2936 | 66 | 432 | 31.6691 |
169+
| 99% | 0.1247 | 0.0627 | 14.2936 | 66 | 432 | 31.6691 |
170+
+------------+----------+---------+-------------+--------------+---------------+----------------------+
171+
```
172+
173+
See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage).

0 commit comments

Comments
 (0)