Skip to content

Commit 990780b

Browse files
committed
fix markdown
Signed-off-by: wangli <wangli858794774@gmail.com>
1 parent 58d10b8 commit 990780b

28 files changed

+125
-86
lines changed

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,3 @@ CI passed with new added/existing test.
2525
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
2626
If tests were not added, please describe why they were not added and/or why it was difficult to add.
2727
-->
28-

CODE_OF_CONDUCT.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -125,4 +125,3 @@ Community Impact Guidelines were inspired by
125125
For answers to common questions about this code of conduct, see the
126126
[Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
127127
[Contributor Covenant translations](https://www.contributor-covenant.org/translations).
128-

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,10 @@ By using vLLM Ascend plugin, popular open-source models, including Transformer-l
3636
- Hardware: Atlas 800I A2 Inference series, Atlas A2 Training series
3737
- OS: Linux
3838
- Software:
39-
* Python >= 3.9, < 3.12
40-
* CANN >= 8.1.RC1
41-
* PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
42-
* vLLM (the same version as vllm-ascend)
39+
- Python >= 3.9, < 3.12
40+
- CANN >= 8.1.RC1
41+
- PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
42+
- vLLM (the same version as vllm-ascend)
4343

4444
## Getting Started
4545

README.zh.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,10 +37,10 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
3737
- 硬件:Atlas 800I A2 Inference系列、Atlas A2 Training系列
3838
- 操作系统:Linux
3939
- 软件:
40-
* Python >= 3.9, < 3.12
41-
* CANN >= 8.1.RC1
42-
* PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
43-
* vLLM (与vllm-ascend版本一致)
40+
- Python >= 3.9, < 3.12
41+
- CANN >= 8.1.RC1
42+
- PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
43+
- vLLM (与vllm-ascend版本一致)
4444

4545
## 开始使用
4646

benchmarks/README.md

Lines changed: 53 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -4,41 +4,41 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
44
# Overview
55
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
66
- Latency tests
7-
- Input length: 32 tokens.
8-
- Output length: 128 tokens.
9-
- Batch size: fixed (8).
10-
- Models: Qwen2.5-7B-Instruct, Qwen3-8B.
11-
- Evaluation metrics: end-to-end latency (mean, median, p99).
7+
- Input length: 32 tokens.
8+
- Output length: 128 tokens.
9+
- Batch size: fixed (8).
10+
- Models: Qwen2.5-7B-Instruct, Qwen3-8B.
11+
- Evaluation metrics: end-to-end latency (mean, median, p99).
1212

1313
- Throughput tests
14-
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
15-
- Output length: the corresponding output length of these 200 prompts.
16-
- Batch size: dynamically determined by vllm to achieve maximum throughput.
17-
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
18-
- Evaluation metrics: throughput.
14+
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
15+
- Output length: the corresponding output length of these 200 prompts.
16+
- Batch size: dynamically determined by vllm to achieve maximum throughput.
17+
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
18+
- Evaluation metrics: throughput.
1919
- Serving tests
20-
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
21-
- Output length: the corresponding output length of these 200 prompts.
22-
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
23-
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
24-
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
25-
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
20+
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
21+
- Output length: the corresponding output length of these 200 prompts.
22+
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
23+
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
24+
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
25+
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
2626

2727
**Benchmarking Duration**: about 800 senond for single model.
2828

29-
3029
# Quick Use
3130
## Prerequisites
3231
Before running the benchmarks, ensure the following:
3332

3433
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
3534

3635
- Install necessary dependencies for benchmarks:
37-
```
36+
37+
```
3838
pip install -r benchmarks/requirements-bench.txt
3939
```
40-
41-
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
40+
41+
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
4242
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
4343
4444
```shell
@@ -72,54 +72,56 @@ Before running the benchmarks, ensure the following:
7272
}
7373
]
7474
```
75-
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
76-
77-
- **Test Overview**
78-
- Test Name: serving_qwen2_5vl_7B_tp1
75+
76+
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
7977

80-
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
78+
- **Test Overview**
79+
- Test Name: serving_qwen2_5vl_7B_tp1
8180

82-
- Server Parameters
83-
- Model: Qwen/Qwen2.5-VL-7B-Instruct
81+
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
8482

85-
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
83+
- Server Parameters
84+
- Model: Qwen/Qwen2.5-VL-7B-Instruct
8685

87-
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
86+
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
8887

89-
- disable_log_stats: disables logging of performance statistics.
88+
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
9089

91-
- disable_log_requests: disables logging of individual requests.
90+
- disable_log_stats: disables logging of performance statistics.
9291

93-
- Trust Remote Code: enabled (allows execution of model-specific custom code)
92+
- disable_log_requests: disables logging of individual requests.
9493

95-
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
94+
- Trust Remote Code: enabled (allows execution of model-specific custom code)
9695

97-
- Client Parameters
96+
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
9897

99-
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
98+
- Client Parameters
10099

101-
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
100+
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
102101

103-
- Dataset Source: Hugging Face (hf)
102+
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
104103

105-
- Dataset Split: train
104+
- Dataset Source: Hugging Face (hf)
106105

107-
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
106+
- Dataset Split: train
108107

109-
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
110-
111-
- Number of Prompts: 200 (the total number of prompts used during the test)
108+
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
112109

110+
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
113111

112+
- Number of Prompts: 200 (the total number of prompts used during the test)
114113

115114
## Run benchmarks
116115

117116
### Use benchmark script
118117
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
118+
119119
```
120120
bash benchmarks/scripts/run-performance-benchmarks.sh
121121
```
122+
122123
Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
124+
123125
```
124126
.
125127
|-- serving_qwen2_5_7B_tp1_qps_1.json
@@ -129,6 +131,7 @@ Once the script completes, you can find the results in the benchmarks/results fo
129131
|-- latency_qwen2_5_7B_tp1.json
130132
|-- throughput_qwen2_5_7B_tp1.json
131133
```
134+
132135
These files contain detailed benchmarking results for further analysis.
133136

134137
### Use benchmark cli
@@ -137,11 +140,14 @@ For more flexible and customized use, benchmark cli is also provided to run onli
137140
Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
138141
#### Online serving
139142
1. Launch the server:
143+
140144
```shell
141145
vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
142146
```
147+
143148
2. Running performance tests using cli
144-
```shell
149+
150+
```shell
145151
vllm bench serve --model Qwen2.5-VL-7B-Instruct\
146152
--endpoint-type "openai-chat" --dataset-name hf \
147153
--hf-split train --endpoint "/v1/chat/completions" \
@@ -152,14 +158,17 @@ Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
152158

153159
#### Offline
154160
- **Throughput**
161+
155162
```shell
156163
vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
157164
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
158165
--dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
159166
--num-prompts 200 --backend vllm
160167
```
168+
161169
- **Latency**
162-
```shell
170+
171+
```shell
163172
vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
164173
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
165174
--load-format dummy --num-iters-warmup 5 --num-iters 15

benchmarks/scripts/perf_result_template.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,4 @@
2828
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
2929
- Evaluation metrics: throughput.
3030

31-
{throughput_tests_markdown_table}
31+
{throughput_tests_markdown_table}

docs/README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,3 @@ python -m http.server -d _build/html/
2020
```
2121

2222
Launch your browser and open http://localhost:8000/.
23-

docs/source/community/governance.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Governance
22

33
## Mission
4-
As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM.
4+
As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM.
55

66
## Principles
77
vLLM Ascend follows the vLLM community's code of conduct:[vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)
@@ -13,7 +13,7 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
1313

1414
**Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs, code
1515

16-
**Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement.
16+
**Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement.
1717

1818
Contributors will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo `Triage` permissions (`Can read and clone this repository. Can also manage issues and pull requests`) to help community developers collaborate more efficiently.
1919

@@ -22,9 +22,9 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
2222
**Responsibility:** Develop the project's vision and mission. Maintainers are responsible for driving the technical direction of the entire project and ensuring its overall success, possessing code merge permissions. They formulate the roadmap, review contributions from community members, continuously contribute code, and actively engage in community activities (such as regular meetings/events).
2323

2424
**Requirements:** Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ codebases, with a commitment to sustained code contributions. Competency in ‌design/development/PR review workflows‌.
25-
- **Review Quality‌:** Actively participate in community code reviews, ensuring high-quality code integration.
26-
- **Quality Contribution‌:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
27-
- **Community Involvement‌:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
25+
- **Review Quality‌:** Actively participate in community code reviews, ensuring high-quality code integration.
26+
- **Quality Contribution‌:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
27+
- **Community Involvement‌:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
2828

2929
Requires approval from existing Maintainers. The vLLM community has the final decision-making authority.
3030

docs/source/community/user_stories/llamafactory.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
66

7-
LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model.
7+
LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model.
88

99
**The Business Challenge**
1010

docs/source/developer_guide/contribution/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ But you can still set up dev env on Linux/Windows/macOS for linting and basic
1313
test as following commands:
1414

1515
#### Run lint locally
16+
1617
```bash
1718
# Choose a base dir (~/vllm-project/) and set up venv
1819
cd ~/vllm-project/
@@ -103,7 +104,6 @@ If the PR spans more than one category, please include all relevant prefixes.
103104
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
104105
If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
105106

106-
107107
:::{toctree}
108108
:caption: Index
109109
:maxdepth: 1

0 commit comments

Comments
 (0)