|
| 1 | +# Using EvalScope |
| 2 | + |
| 3 | +This document will guide you have model inference stress testing and accuracy testing using [EvalScope](https://github.com/modelscope/evalscope). |
| 4 | + |
| 5 | +## 1. Online serving |
| 6 | + |
| 7 | +You can run docker container to start the vLLM server on a single NPU: |
| 8 | + |
| 9 | +```{code-block} bash |
| 10 | + :substitutions: |
| 11 | +# Update DEVICE according to your device (/dev/davinci[0-7]) |
| 12 | +export DEVICE=/dev/davinci7 |
| 13 | +# Update the vllm-ascend image |
| 14 | +export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| |
| 15 | +docker run --rm \ |
| 16 | +--name vllm-ascend \ |
| 17 | +--device $DEVICE \ |
| 18 | +--device /dev/davinci_manager \ |
| 19 | +--device /dev/devmm_svm \ |
| 20 | +--device /dev/hisi_hdc \ |
| 21 | +-v /usr/local/dcmi:/usr/local/dcmi \ |
| 22 | +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ |
| 23 | +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ |
| 24 | +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ |
| 25 | +-v /etc/ascend_install.info:/etc/ascend_install.info \ |
| 26 | +-v /root/.cache:/root/.cache \ |
| 27 | +-p 8000:8000 \ |
| 28 | +-e VLLM_USE_MODELSCOPE=True \ |
| 29 | +-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ |
| 30 | +-it $IMAGE \ |
| 31 | +vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 |
| 32 | +``` |
| 33 | + |
| 34 | +If your service start successfully, you can see the info shown below: |
| 35 | + |
| 36 | +``` |
| 37 | +INFO: Started server process [6873] |
| 38 | +INFO: Waiting for application startup. |
| 39 | +INFO: Application startup complete. |
| 40 | +``` |
| 41 | + |
| 42 | +Once your server is started, you can query the model with input prompts in new terminal: |
| 43 | + |
| 44 | +``` |
| 45 | +curl http://localhost:8000/v1/completions \ |
| 46 | + -H "Content-Type: application/json" \ |
| 47 | + -d '{ |
| 48 | + "model": "Qwen/Qwen2.5-7B-Instruct", |
| 49 | + "prompt": "The future of AI is", |
| 50 | + "max_tokens": 7, |
| 51 | + "temperature": 0 |
| 52 | + }' |
| 53 | +``` |
| 54 | + |
| 55 | +## 2. Install EvalScope using pip |
| 56 | + |
| 57 | +You can install EvalScope by using: |
| 58 | + |
| 59 | +```bash |
| 60 | +python3 -m venv .venv-evalscope |
| 61 | +source .venv-evalscope/bin/activate |
| 62 | +pip install gradio plotly evalscope |
| 63 | +``` |
| 64 | + |
| 65 | +## 3. Run gsm8k accuracy test using EvalScope |
| 66 | + |
| 67 | +You can `evalscope eval` run gsm8k accuracy test: |
| 68 | +``` |
| 69 | +evalscope eval \ |
| 70 | + --model Qwen/Qwen2.5-7B-Instruct \ |
| 71 | + --api-url http://localhost:8000/v1 \ |
| 72 | + --api-key EMPTY \ |
| 73 | + --eval-type service \ |
| 74 | + --datasets gsm8k \ |
| 75 | + --limit 10 |
| 76 | +``` |
| 77 | + |
| 78 | +After 1-2 mins, the output is as shown below: |
| 79 | + |
| 80 | +```shell |
| 81 | ++---------------------+-----------+-----------------+----------+-------+---------+---------+ |
| 82 | +| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | |
| 83 | ++=====================+===========+=================+==========+=======+=========+=========+ |
| 84 | +| Qwen2.5-7B-Instruct | gsm8k | AverageAccuracy | main | 10 | 0.8 | default | |
| 85 | ++---------------------+-----------+-----------------+----------+-------+---------+---------+ |
| 86 | +``` |
| 87 | + |
| 88 | +See more detail in: [EvalScope doc - Model API Service Evaluation](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation). |
| 89 | + |
| 90 | +## 4. Run model inference stress testing using EvalScope |
| 91 | + |
| 92 | +### Install EvalScope[perf] using pip |
| 93 | + |
| 94 | +```shell |
| 95 | +pip install evalscope[perf] -U |
| 96 | +``` |
| 97 | + |
| 98 | +### Basic usage |
| 99 | + |
| 100 | +You can use `evalscope perf` run perf test: |
| 101 | +``` |
| 102 | +evalscope perf \ |
| 103 | + --url "http://localhost:8000/v1/chat/completions" \ |
| 104 | + --parallel 5 \ |
| 105 | + --model Qwen/Qwen2.5-7B-Instruct \ |
| 106 | + --number 20 \ |
| 107 | + --api openai \ |
| 108 | + --dataset openqa \ |
| 109 | + --stream |
| 110 | +``` |
| 111 | + |
| 112 | +### Output results |
| 113 | + |
| 114 | +After 1-2 mins, the output is as shown below: |
| 115 | + |
| 116 | +```shell |
| 117 | +Benchmarking summary: |
| 118 | ++-----------------------------------+---------------------------------------------------------------+ |
| 119 | +| Key | Value | |
| 120 | ++===================================+===============================================================+ |
| 121 | +| Time taken for tests (s) | 38.3744 | |
| 122 | ++-----------------------------------+---------------------------------------------------------------+ |
| 123 | +| Number of concurrency | 5 | |
| 124 | ++-----------------------------------+---------------------------------------------------------------+ |
| 125 | +| Total requests | 20 | |
| 126 | ++-----------------------------------+---------------------------------------------------------------+ |
| 127 | +| Succeed requests | 20 | |
| 128 | ++-----------------------------------+---------------------------------------------------------------+ |
| 129 | +| Failed requests | 0 | |
| 130 | ++-----------------------------------+---------------------------------------------------------------+ |
| 131 | +| Output token throughput (tok/s) | 132.6926 | |
| 132 | ++-----------------------------------+---------------------------------------------------------------+ |
| 133 | +| Total token throughput (tok/s) | 158.8819 | |
| 134 | ++-----------------------------------+---------------------------------------------------------------+ |
| 135 | +| Request throughput (req/s) | 0.5212 | |
| 136 | ++-----------------------------------+---------------------------------------------------------------+ |
| 137 | +| Average latency (s) | 8.3612 | |
| 138 | ++-----------------------------------+---------------------------------------------------------------+ |
| 139 | +| Average time to first token (s) | 0.1035 | |
| 140 | ++-----------------------------------+---------------------------------------------------------------+ |
| 141 | +| Average time per output token (s) | 0.0329 | |
| 142 | ++-----------------------------------+---------------------------------------------------------------+ |
| 143 | +| Average input tokens per request | 50.25 | |
| 144 | ++-----------------------------------+---------------------------------------------------------------+ |
| 145 | +| Average output tokens per request | 254.6 | |
| 146 | ++-----------------------------------+---------------------------------------------------------------+ |
| 147 | +| Average package latency (s) | 0.0324 | |
| 148 | ++-----------------------------------+---------------------------------------------------------------+ |
| 149 | +| Average package per request | 254.6 | |
| 150 | ++-----------------------------------+---------------------------------------------------------------+ |
| 151 | +| Expected number of requests | 20 | |
| 152 | ++-----------------------------------+---------------------------------------------------------------+ |
| 153 | +| Result DB path | outputs/20250423_002442/Qwen2.5-7B-Instruct/benchmark_data.db | |
| 154 | ++-----------------------------------+---------------------------------------------------------------+ |
| 155 | + |
| 156 | +Percentile results: |
| 157 | ++------------+----------+---------+-------------+--------------+---------------+----------------------+ |
| 158 | +| Percentile | TTFT (s) | ITL (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) | |
| 159 | ++------------+----------+---------+-------------+--------------+---------------+----------------------+ |
| 160 | +| 10% | 0.0962 | 0.031 | 4.4571 | 42 | 135 | 29.9767 | |
| 161 | +| 25% | 0.0971 | 0.0318 | 6.3509 | 47 | 193 | 30.2157 | |
| 162 | +| 50% | 0.0987 | 0.0321 | 9.3387 | 49 | 285 | 30.3969 | |
| 163 | +| 66% | 0.1017 | 0.0324 | 9.8519 | 52 | 302 | 30.5182 | |
| 164 | +| 75% | 0.107 | 0.0328 | 10.2391 | 55 | 313 | 30.6124 | |
| 165 | +| 80% | 0.1221 | 0.0329 | 10.8257 | 58 | 330 | 30.6759 | |
| 166 | +| 90% | 0.1245 | 0.0333 | 13.0472 | 62 | 404 | 30.9644 | |
| 167 | +| 95% | 0.1247 | 0.0336 | 14.2936 | 66 | 432 | 31.6691 | |
| 168 | +| 98% | 0.1247 | 0.0353 | 14.2936 | 66 | 432 | 31.6691 | |
| 169 | +| 99% | 0.1247 | 0.0627 | 14.2936 | 66 | 432 | 31.6691 | |
| 170 | ++------------+----------+---------+-------------+--------------+---------------+----------------------+ |
| 171 | +``` |
| 172 | + |
| 173 | +See more detail in: [EvalScope doc - Model Inference Stress Testing](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage). |
0 commit comments