vllm-project
diff --git a/‎.github/PULL_REQUEST_TEMPLATE.md
Lines changed: 0 additions & 1 deletion b/‎.github/PULL_REQUEST_TEMPLATE.md
Lines changed: 0 additions & 1 deletion
diff --git a/‎CODE_OF_CONDUCT.md
Lines changed: 0 additions & 1 deletion b/‎CODE_OF_CONDUCT.md
Lines changed: 0 additions & 1 deletion
diff --git a/‎README.md
Lines changed: 4 additions & 4 deletions b/‎README.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎README.zh.md
Lines changed: 4 additions & 4 deletions b/‎README.zh.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎benchmarks/README.md
Lines changed: 53 additions & 44 deletions b/‎benchmarks/README.md
Lines changed: 53 additions & 44 deletions
diff --git a/‎benchmarks/scripts/perf_result_template.md
Lines changed: 1 addition & 1 deletion b/‎benchmarks/scripts/perf_result_template.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/README.md
Lines changed: 0 additions & 1 deletion b/‎docs/README.md
Lines changed: 0 additions & 1 deletion
diff --git a/‎docs/source/community/governance.md
Lines changed: 5 additions & 5 deletions b/‎docs/source/community/governance.md
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/source/community/user_stories/llamafactory.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/community/user_stories/llamafactory.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/developer_guide/contribution/index.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/developer_guide/contribution/index.md
Lines changed: 1 addition & 1 deletion
@@ -25,4 +25,3 @@ CI passed with new added/existing test.
 If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
 If tests were not added, please describe why they were not added and/or why it was difficult to add.
 -->
-
@@ -125,4 +125,3 @@ Community Impact Guidelines were inspired by
 For answers to common questions about this code of conduct, see the
 [Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
 [Contributor Covenant translations](https://www.contributor-covenant.org/translations).
-
@@ -36,10 +36,10 @@ By using vLLM Ascend plugin, popular open-source models, including Transformer-l
 - Hardware: Atlas 800I A2 Inference series, Atlas A2 Training series
 - OS: Linux
 - Software:
-  * Python >= 3.9, < 3.12
-  * CANN >= 8.1.RC1
-  * PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
-  * vLLM (the same version as vllm-ascend)
+  - Python >= 3.9, < 3.12
+  - CANN >= 8.1.RC1
+  - PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
+  - vLLM (the same version as vllm-ascend)
 
 ## Getting Started
 
 
@@ -37,10 +37,10 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
 - 硬件：Atlas 800I A2 Inference系列、Atlas A2 Training系列
 - 操作系统：Linux
 - 软件：
-  * Python >= 3.9, < 3.12
-  * CANN >= 8.1.RC1
-  * PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
-  * vLLM (与vllm-ascend版本一致)
+  - Python >= 3.9, < 3.12
+  - CANN >= 8.1.RC1
+  - PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250619
+  - vLLM (与vllm-ascend版本一致)
 
 ## 开始使用
 
 
@@ -4,41 +4,41 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
 # Overview
 **Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
 - Latency tests
-    - Input length: 32 tokens.
-    - Output length: 128 tokens.
-    - Batch size: fixed (8).
-    - Models: Qwen2.5-7B-Instruct, Qwen3-8B.
-    - Evaluation metrics: end-to-end latency (mean, median, p99).
+  - Input length: 32 tokens.
+  - Output length: 128 tokens.
+  - Batch size: fixed (8).
+  - Models: Qwen2.5-7B-Instruct, Qwen3-8B.
+  - Evaluation metrics: end-to-end latency (mean, median, p99).
 
 - Throughput tests
-    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
-    - Output length: the corresponding output length of these 200 prompts.
-    - Batch size: dynamically determined by vllm to achieve maximum throughput.
-    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
-    - Evaluation metrics: throughput.
+  - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
+  - Output length: the corresponding output length of these 200 prompts.
+  - Batch size: dynamically determined by vllm to achieve maximum throughput.
+  - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
+  - Evaluation metrics: throughput.
 - Serving tests
-    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
-    - Output length: the corresponding output length of these 200 prompts.
-    - Batch size: dynamically determined by vllm and the arrival pattern of the requests.
-    - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
-    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
-    - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
+  - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
+  - Output length: the corresponding output length of these 200 prompts.
+  - Batch size: dynamically determined by vllm and the arrival pattern of the requests.
+  - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
+  - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
+  - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
 
 **Benchmarking Duration**: about 800 senond for single model.
 
-
 # Quick Use
 ## Prerequisites
 Before running the benchmarks, ensure the following:
 
 - vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
 
 - Install necessary dependencies for benchmarks:
-    ```
+  
+  ```
     pip install -r benchmarks/requirements-bench.txt
     ```
-    
-- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time. 
+  
+- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
 - If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
 
   ```shell
@@ -72,54 +72,56 @@ Before running the benchmarks, ensure the following:
   }
   ]
   ```
-  this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
-
-  - **Test Overview**
-     - Test Name: serving_qwen2_5vl_7B_tp1
+  
+this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
 
-     - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
+- **Test Overview**
+  - Test Name: serving_qwen2_5vl_7B_tp1
 
-   - Server Parameters
-      - Model: Qwen/Qwen2.5-VL-7B-Instruct
+  - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
 
-      - Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
+- Server Parameters
+  - Model: Qwen/Qwen2.5-VL-7B-Instruct
 
-      - Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
+  - Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
 
-      - disable_log_stats: disables logging of performance statistics.
+  - Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
 
-      - disable_log_requests: disables logging of individual requests.
+  - disable_log_stats: disables logging of performance statistics.
 
-      - Trust Remote Code: enabled (allows execution of model-specific custom code)
+  - disable_log_requests: disables logging of individual requests.
 
-      - Max Model Length: 16,384 tokens (maximum context length supported by the model)
+  - Trust Remote Code: enabled (allows execution of model-specific custom code)
 
-  - Client Parameters
+  - Max Model Length: 16,384 tokens (maximum context length supported by the model)
 
-     - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
+- Client Parameters
 
-     - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
+  - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
 
-     - Dataset Source: Hugging Face (hf)
+  - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
 
-     - Dataset Split: train
+  - Dataset Source: Hugging Face (hf)
 
-     - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
+  - Dataset Split: train
 
-     - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
-
-     - Number of Prompts: 200 (the total number of prompts used during the test)
+  - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
 
+  - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
 
+  - Number of Prompts: 200 (the total number of prompts used during the test)
 
 ## Run benchmarks
 
 ### Use benchmark script
 The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
+
 ```
 bash benchmarks/scripts/run-performance-benchmarks.sh
 ```
+
 Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
+
 ```
 .
 |-- serving_qwen2_5_7B_tp1_qps_1.json
@@ -129,6 +131,7 @@ Once the script completes, you can find the results in the benchmarks/results fo
 |-- latency_qwen2_5_7B_tp1.json
 |-- throughput_qwen2_5_7B_tp1.json
 ```
+
 These files contain detailed benchmarking results for further analysis.
 
 ### Use benchmark cli
@@ -137,11 +140,14 @@ For more flexible and customized use, benchmark cli is also provided to run onli
 Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
 #### Online serving
 1. Launch the server:
+
    ```shell
    vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
    ```
+
 2. Running performance tests using cli
-   ```shell
+   
+```shell
     vllm bench serve --model Qwen2.5-VL-7B-Instruct\
     --endpoint-type "openai-chat" --dataset-name hf \
     --hf-split train --endpoint "/v1/chat/completions" \
@@ -152,14 +158,17 @@ Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
 
 #### Offline
 - **Throughput**
+
     ```shell
     vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
     --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
     --dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
     --num-prompts 200 --backend vllm
     ```
+
 - **Latency**
-    ```shell
+  
+  ```shell
     vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
     --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
     --load-format dummy --num-iters-warmup 5 --num-iters 15
 
@@ -28,4 +28,4 @@
 - Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
 - Evaluation metrics: throughput.
 
-{throughput_tests_markdown_table}
+{throughput_tests_markdown_table}
@@ -20,4 +20,3 @@ python -m http.server -d _build/html/
 ```
 
 Launch your browser and open http://localhost:8000/.
-
@@ -1,7 +1,7 @@
 # Governance
 
 ## Mission
-As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM. 
+As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM.
 
 ## Principles
 vLLM Ascend follows the vLLM community's code of conduct：[vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)
@@ -13,7 +13,7 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
 
     **Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs, code
 
-    **Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement. 
+    **Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement.
 
     Contributors will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo `Triage` permissions (`Can read and clone this repository. Can also manage issues and pull requests`) to help community developers collaborate more efficiently.
 
@@ -22,9 +22,9 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
     **Responsibility:** Develop the project's vision and mission. Maintainers are responsible for driving the technical direction of the entire project and ensuring its overall success, possessing code merge permissions. They formulate the roadmap, review contributions from community members, continuously contribute code, and actively engage in community activities (such as regular meetings/events).
 
     **Requirements:** Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ codebases, with a commitment to sustained code contributions. Competency in ‌design/development/PR review workflows‌.
-    - **Review Quality‌:** Actively participate in community code reviews, ensuring high-quality code integration.
-    - **Quality Contribution‌:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
-    - **Community Involvement‌:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
+  - **Review Quality‌:** Actively participate in community code reviews, ensuring high-quality code integration.
+  - **Quality Contribution‌:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
+  - **Community Involvement‌:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
 
     Requires approval from existing Maintainers. The vLLM community has the final decision-making authority.
 
 
@@ -4,7 +4,7 @@
 
 [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
 
-LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model. 
+LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model.
 
 **The Business Challenge**
 
 
@@ -13,6 +13,7 @@ But you can still set up dev env on Linux/Windows/macOS for linting and basic
 test as following commands:
 
 #### Run lint locally
+
 ```bash
 # Choose a base dir (~/vllm-project/) and set up venv
 cd ~/vllm-project/
@@ -103,7 +104,6 @@ If the PR spans more than one category, please include all relevant prefixes.
 You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
 If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
 
-
 :::{toctree}
 :caption: Index
 :maxdepth: 1
Original file line number	Diff line number	Diff line change
`@@ -20,4 +20,3 @@ python -m http.server -d _build/html/`
`20`	`20`	```
`21`	`21`
`22`	`22`	`Launch your browser and open http://localhost:8000/.`
`23`		`-`