You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/PULL_REQUEST_TEMPLATE.md
-1Lines changed: 0 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -25,4 +25,3 @@ CI passed with new added/existing test.
25
25
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
26
26
If tests were not added, please describe why they were not added and/or why it was difficult to add.
Copy file name to clipboardExpand all lines: benchmarks/README.md
+53-44Lines changed: 53 additions & 44 deletions
Original file line number
Diff line number
Diff line change
@@ -4,41 +4,41 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
4
4
# Overview
5
5
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
21
-
- Output length: the corresponding output length of these 200 prompts.
22
-
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
23
-
-**Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
20
+
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
21
+
- Output length: the corresponding output length of these 200 prompts.
22
+
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
23
+
-**Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
26
26
27
27
**Benchmarking Duration**: about 800 senond for single model.
28
28
29
-
30
29
# Quick Use
31
30
## Prerequisites
32
31
Before running the benchmarks, ensure the following:
33
32
34
33
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
35
34
36
35
- Install necessary dependencies for benchmarks:
37
-
```
36
+
37
+
```
38
38
pip install -r benchmarks/requirements-bench.txt
39
39
```
40
-
41
-
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
40
+
41
+
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
42
42
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
43
43
44
44
```shell
@@ -72,54 +72,56 @@ Before running the benchmarks, ensure the following:
72
72
}
73
73
]
74
74
```
75
-
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
76
-
77
-
-**Test Overview**
78
-
- Test Name: serving_qwen2_5vl_7B_tp1
75
+
76
+
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
79
77
80
-
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
78
+
-**Test Overview**
79
+
- Test Name: serving_qwen2_5vl_7B_tp1
81
80
82
-
- Server Parameters
83
-
- Model: Qwen/Qwen2.5-VL-7B-Instruct
81
+
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
84
82
85
-
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
83
+
- Server Parameters
84
+
- Model: Qwen/Qwen2.5-VL-7B-Instruct
86
85
87
-
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
86
+
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
88
87
89
-
- disable_log_stats: disables logging of performance statistics.
88
+
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
90
89
91
-
- disable_log_requests: disables logging of individual requests.
90
+
- disable_log_stats: disables logging of performance statistics.
-Max Model Length: 16,384 tokens (maximum context length supported by the model)
98
97
99
-
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
98
+
- Client Parameters
100
99
101
-
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
100
+
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
102
101
103
-
- Dataset Source: Hugging Face (hf)
102
+
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
104
103
105
-
- Dataset Split: train
104
+
- Dataset Source: Hugging Face (hf)
106
105
107
-
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
106
+
- Dataset Split: train
108
107
109
-
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
110
-
111
-
- Number of Prompts: 200 (the total number of prompts used during the test)
108
+
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
112
109
110
+
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
113
111
112
+
- Number of Prompts: 200 (the total number of prompts used during the test)
114
113
115
114
## Run benchmarks
116
115
117
116
### Use benchmark script
118
117
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
Copy file name to clipboardExpand all lines: docs/source/community/governance.md
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Governance
2
2
3
3
## Mission
4
-
As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM.
4
+
As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM.
5
5
6
6
## Principles
7
7
vLLM Ascend follows the vLLM community's code of conduct:[vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)
@@ -13,7 +13,7 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
13
13
14
14
**Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs, code
15
15
16
-
**Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement.
16
+
**Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement.
17
17
18
18
Contributors will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo `Triage` permissions (`Can read and clone this repository. Can also manage issues and pull requests`) to help community developers collaborate more efficiently.
19
19
@@ -22,9 +22,9 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
22
22
**Responsibility:** Develop the project's vision and mission. Maintainers are responsible for driving the technical direction of the entire project and ensuring its overall success, possessing code merge permissions. They formulate the roadmap, review contributions from community members, continuously contribute code, and actively engage in community activities (such as regular meetings/events).
23
23
24
24
**Requirements:** Deep understanding of vLLM and vLLM Ascend codebases, with a commitment to sustained code contributions. Competency in design/development/PR review workflows.
25
-
-**Review Quality:** Actively participate in community code reviews, ensuring high-quality code integration.
26
-
-**Quality Contribution:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
27
-
-**Community Involvement:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
25
+
-**Review Quality:** Actively participate in community code reviews, ensuring high-quality code integration.
26
+
-**Quality Contribution:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
27
+
-**Community Involvement:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
28
28
29
29
Requires approval from existing Maintainers. The vLLM community has the final decision-making authority.
Copy file name to clipboardExpand all lines: docs/source/community/user_stories/llamafactory.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
6
6
7
-
LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model.
7
+
LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model.
Copy file name to clipboardExpand all lines: docs/source/developer_guide/contribution/index.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,7 @@ But you can still set up dev env on Linux/Windows/macOS for linting and basic
13
13
test as following commands:
14
14
15
15
#### Run lint locally
16
+
16
17
```bash
17
18
# Choose a base dir (~/vllm-project/) and set up venv
18
19
cd~/vllm-project/
@@ -103,7 +104,6 @@ If the PR spans more than one category, please include all relevant prefixes.
103
104
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
104
105
If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
0 commit comments