vllm-project · Potabk · Jul 25, 2025 · Jul 25, 2025 · Jul 25, 2025 · Jul 25, 2025
diff --git a/.github/format_pr_body.sh b/.github/format_pr_body.sh
@@ -1,3 +1,4 @@
+#!/bin/bash
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 #
@@ -15,7 +16,6 @@
 # This file is a part of the vllm-ascend project.
 # Adapted from vllm/.github/scripts/cleanup_pr_body.sh
 
-#!/bin/bash
 
 set -eux
 

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -4,7 +4,7 @@ default_install_hook_types:
 default_stages:
   - pre-commit # Run locally
   - manual # Run in CI
-exclude: 'examples/.*' # Exclude examples from all hooks by default
+exclude: '^(examples/.*|vllm-empty/.*)$' # Exclude examples from all hooks by default
 repos:
 - repo: https://github.com/codespell-project/codespell
   rev: v2.4.1
@@ -92,12 +92,11 @@ repos:
     language: system
     types: [python]
     stages: [manual] # Only run in CI
-  # FIXME: enable shellcheck
-  # - id: shellcheck
-  #   name: Lint shell scripts
-  #   entry: tools/shellcheck.sh
-  #   language: script
-  #   types: [shell]
+  - id: shellcheck
+    name: Lint shell scripts
+    entry: tools/shellcheck.sh
+    language: script
+    types: [shell]
   - id: png-lint
     name: Lint PNG exports from excalidraw
     entry: tools/png-lint.sh

diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -4,25 +4,25 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
 # Overview
 **Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
 - Latency tests
-    - Input length: 32 tokens.
-    - Output length: 128 tokens.
-    - Batch size: fixed (8).
-    - Models: Qwen2.5-7B-Instruct, Qwen3-8B.
-    - Evaluation metrics: end-to-end latency (mean, median, p99).
+  - Input length: 32 tokens.
+  - Output length: 128 tokens.
+  - Batch size: fixed (8).
+  - Models: Qwen2.5-7B-Instruct, Qwen3-8B.
+  - Evaluation metrics: end-to-end latency (mean, median, p99).
 
 - Throughput tests
-    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
-    - Output length: the corresponding output length of these 200 prompts.
-    - Batch size: dynamically determined by vllm to achieve maximum throughput.
-    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
-    - Evaluation metrics: throughput.
+  - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
+  - Output length: the corresponding output length of these 200 prompts.
+  - Batch size: dynamically determined by vllm to achieve maximum throughput.
+  - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
+  - Evaluation metrics: throughput.
 - Serving tests
-    - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
-    - Output length: the corresponding output length of these 200 prompts.
-    - Batch size: dynamically determined by vllm and the arrival pattern of the requests.
-    - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
-    - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
-    - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
+  - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
+  - Output length: the corresponding output length of these 200 prompts.
+  - Batch size: dynamically determined by vllm and the arrival pattern of the requests.
+  - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
+  - Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
+  - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
 
 **Benchmarking Duration**: about 800 senond for single model.
 
@@ -75,10 +75,10 @@ Before running the benchmarks, ensure the following:
 
 this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
 
-  - **Test Overview**
-     - Test Name: serving_qwen2_5vl_7B_tp1
+- **Test Overview**
+  - Test Name: serving_qwen2_5vl_7B_tp1
 
-     - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
+  - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
 
   - Server Parameters
      - Model: Qwen/Qwen2.5-VL-7B-Instruct
@@ -95,21 +95,21 @@ this Json will be structured and parsed into server parameters and client parame
 
      - Max Model Length: 16,384 tokens (maximum context length supported by the model)
 
-  - Client Parameters
+- Client Parameters
 
-     - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
+  - Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
 
-     - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
+  - Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
 
-     - Dataset Source: Hugging Face (hf)
+  - Dataset Source: Hugging Face (hf)
 
-     - Dataset Split: train
+  - Dataset Split: train
 
-     - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
+  - Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
 
-     - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
+  - Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
 
-     - Number of Prompts: 200 (the total number of prompts used during the test)
+  - Number of Prompts: 200 (the total number of prompts used during the test)
 
 ## Run benchmarks
 

diff --git a/benchmarks/ops/ben_vocabparallelembedding.py b/benchmarks/ops/ben_vocabparallelembedding.py
@@ -50,17 +50,9 @@ def get_masked_input_and_mask_ref(
 ) -> Tuple[torch.Tensor, torch.Tensor]:
     """Reference implementation for verification"""
     org_vocab_mask = (input_ >= org_vocab_start_index) & (input_ < org_vocab_end_index)
-    added_vocab_mask = (input_ >= added_vocab_start_index) & (
-        input_ < added_vocab_end_index
-    )
-    added_offset = (
-        added_vocab_start_index
-        - (org_vocab_end_index - org_vocab_start_index)
-        - num_org_vocab_padding
-    )
-    valid_offset = (org_vocab_start_index * org_vocab_mask) + (
-        added_offset * added_vocab_mask
-    )
+    added_vocab_mask = (input_ >= added_vocab_start_index) & (input_ < added_vocab_end_index)
+    added_offset = added_vocab_start_index - (org_vocab_end_index - org_vocab_start_index) - num_org_vocab_padding
+    valid_offset = (org_vocab_start_index * org_vocab_mask) + (added_offset * added_vocab_mask)
     vocab_mask = org_vocab_mask | added_vocab_mask
     masked_input = vocab_mask * (input_ - valid_offset)
     return masked_input, ~vocab_mask

diff --git a/benchmarks/scripts/convert_json_to_markdown.py b/benchmarks/scripts/convert_json_to_markdown.py
@@ -59,9 +59,7 @@ def results_to_json(latency, throughput, serving):
 
 
 if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="Process the results of the benchmark tests."
-    )
+    parser = argparse.ArgumentParser(description="Process the results of the benchmark tests.")
     parser.add_argument(
         "--results_folder",
         type=str,
@@ -80,11 +78,11 @@ def results_to_json(latency, throughput, serving):
         default="./perf_result_template.md",
         help="The template file for the markdown report.",
     )
+    parser.add_argument("--tag", default="main", help="Tag to be used for release message.")
     parser.add_argument(
-        "--tag", default="main", help="Tag to be used for release message."
-    )
-    parser.add_argument(
-        "--commit_id", default="", help="Commit ID to be used for release message."
+        "--commit_id",
+        default="",
+        help="Commit ID to be used for release message.",
     )
 
     args = parser.parse_args()
@@ -116,9 +114,7 @@ def results_to_json(latency, throughput, serving):
             # get different percentiles
             for perc in [10, 25, 50, 75, 90, 99]:
                 # Multiply 1000 to convert the time unit from s to ms
-                raw_result.update(
-                    {f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]}
-                )
+                raw_result.update({f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]})
             raw_result["avg_latency"] = raw_result["avg_latency"] * 1000
 
             # add the result to raw_result
@@ -142,38 +138,24 @@ def results_to_json(latency, throughput, serving):
     serving_results = pd.DataFrame.from_dict(serving_results)
     throughput_results = pd.DataFrame.from_dict(throughput_results)
 
-    raw_results_json = results_to_json(
-        latency_results, throughput_results, serving_results
-    )
+    raw_results_json = results_to_json(latency_results, throughput_results, serving_results)
 
     # remapping the key, for visualization purpose
     if not latency_results.empty:
-        latency_results = latency_results[list(latency_column_mapping.keys())].rename(
-            columns=latency_column_mapping
-        )
+        latency_results = latency_results[list(latency_column_mapping.keys())].rename(columns=latency_column_mapping)
     if not serving_results.empty:
-        serving_results = serving_results[list(serving_column_mapping.keys())].rename(
-            columns=serving_column_mapping
-        )
+        serving_results = serving_results[list(serving_column_mapping.keys())].rename(columns=serving_column_mapping)
     if not throughput_results.empty:
-        throughput_results = throughput_results[
-            list(throughput_results_column_mapping.keys())
-        ].rename(columns=throughput_results_column_mapping)
+        throughput_results = throughput_results[list(throughput_results_column_mapping.keys())].rename(
+            columns=throughput_results_column_mapping
+        )
 
-    processed_results_json = results_to_json(
-        latency_results, throughput_results, serving_results
-    )
+    processed_results_json = results_to_json(latency_results, throughput_results, serving_results)
 
     # get markdown tables
-    latency_md_table = tabulate(
-        latency_results, headers="keys", tablefmt="pipe", showindex=False
-    )
-    serving_md_table = tabulate(
-        serving_results, headers="keys", tablefmt="pipe", showindex=False
-    )
-    throughput_md_table = tabulate(
-        throughput_results, headers="keys", tablefmt="pipe", showindex=False
-    )
+    latency_md_table = tabulate(latency_results, headers="keys", tablefmt="pipe", showindex=False)
+    serving_md_table = tabulate(serving_results, headers="keys", tablefmt="pipe", showindex=False)
+    throughput_md_table = tabulate(throughput_results, headers="keys", tablefmt="pipe", showindex=False)
 
     # document the result
     print(output_folder)

diff --git a/benchmarks/scripts/run-performance-benchmarks.sh b/benchmarks/scripts/run-performance-benchmarks.sh
@@ -25,8 +25,7 @@ ensure_sharegpt_downloaded() {
   if [ ! -f "$FILE" ]; then
     echo "$FILE not found, downloading from hf-mirror ..."
     mkdir -p "$DIR"
-    wget -O "$FILE" https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-    if [ $? -ne 0 ]; then
+    if ! wget -O "$FILE" https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json; then
       echo "Download failed!" >&2
       return 1
     fi

diff --git a/benchmarks/scripts/run_accuracy.py b/benchmarks/scripts/run_accuracy.py
@@ -206,39 +206,16 @@ def generate_md(model_name, tasks_list, args, datasets):
             else:
                 n_shot = "0"
             flag = ACCURACY_FLAG.get(task_name, "")
-            row = (
-                f"| {task_name:<37} "
-                f"| {flt:<6} "
-                f"| {n_shot:6} "
-                f"| {metric:<6} "
-                f"| {flag}{value:>5.4f} "
-                f"| ± {stderr:>5.4f} |"
-            )
+            row = f"| {task_name:<37} | {flt:<6} | {n_shot:6} | {metric:<6} | {flag}{value:>5.4f} | ± {stderr:>5.4f} |"
             if not task_name.startswith("-"):
                 rows.append(row)
                 rows_sub.append(
-                    "<details>"
-                    + "\n"
-                    + "<summary>"
-                    + task_name
-                    + " details"
-                    + "</summary>"
-                    + "\n" * 2
-                    + header
+                    "<details>" + "\n" + "<summary>" + task_name + " details" + "</summary>" + "\n" * 2 + header
                 )
             rows_sub.append(row)
         rows_sub.append("</details>")
     # Combine all Markdown sections
-    md = (
-        preamble
-        + "\n"
-        + header
-        + "\n"
-        + "\n".join(rows)
-        + "\n"
-        + "\n".join(rows_sub)
-        + "\n"
-    )
+    md = preamble + "\n" + header + "\n" + "\n".join(rows) + "\n" + "\n".join(rows_sub) + "\n"
     print(md)
     return md
 
@@ -268,9 +245,7 @@ def main(args):
     # Evaluate model on each dataset
     for dataset in datasets:
         accuracy_expected = EXPECTED_VALUE[args.model][dataset]
-        p = multiprocessing.Process(
-            target=run_accuracy_test, args=(result_queue, args.model, dataset)
-        )
+        p = multiprocessing.Process(target=run_accuracy_test, args=(result_queue, args.model, dataset))
         p.start()
         p.join()
         if p.is_alive():
@@ -281,11 +256,7 @@ def main(args):
         time.sleep(10)
         result = result_queue.get()
         print(result)
-        if (
-            accuracy_expected - RTOL
-            < result[dataset][FILTER[dataset]]
-            < accuracy_expected + RTOL
-        ):
+        if accuracy_expected - RTOL < result[dataset][FILTER[dataset]] < accuracy_expected + RTOL:
             ACCURACY_FLAG[dataset] = "✅"
         else:
             ACCURACY_FLAG[dataset] = "❌"
@@ -297,9 +268,7 @@ def main(args):
 if __name__ == "__main__":
     multiprocessing.set_start_method("spawn", force=True)
     # Initialize argument parser
-    parser = argparse.ArgumentParser(
-        description="Run model accuracy evaluation and generate report"
-    )
+    parser = argparse.ArgumentParser(description="Run model accuracy evaluation and generate report")
     parser.add_argument("--output", type=str, required=True)
     parser.add_argument("--model", type=str, required=True)
     parser.add_argument("--vllm_ascend_version", type=str, required=False)

diff --git a/docs/source/community/governance.md b/docs/source/community/governance.md
@@ -22,9 +22,9 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
     **Responsibility:** Develop the project's vision and mission. Maintainers are responsible for driving the technical direction of the entire project and ensuring its overall success, possessing code merge permissions. They formulate the roadmap, review contributions from community members, continuously contribute code, and actively engage in community activities (such as regular meetings/events).
 
     **Requirements:** Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ codebases, with a commitment to sustained code contributions. Competency in ‌design/development/PR review workflows‌.
-    - **Review Quality‌:** Actively participate in community code reviews, ensuring high-quality code integration.
-    - **Quality Contribution‌:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
-    - **Community Involvement‌:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
+  - **Review Quality‌:** Actively participate in community code reviews, ensuring high-quality code integration.
+  - **Quality Contribution‌:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
+  - **Community Involvement‌:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
 
     Requires approval from existing Maintainers. The vLLM community has the final decision-making authority.
 

diff --git a/docs/source/user_guide/feature_guide/sleep_mode.md b/docs/source/user_guide/feature_guide/sleep_mode.md
@@ -13,15 +13,15 @@ With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm wi
 The engine(v0/v1) supports two sleep levels to manage memory during idle periods:
 
 - Level 1 Sleep
-    - Action: Offloads model weights and discards the KV cache.
-    - Memory: Model weights are moved to CPU memory; KV cache is forgotten.
-    - Use Case: Suitable when reusing the same model later.
-    - Note: Ensure sufficient CPU memory is available to hold the model weights.
+  - Action: Offloads model weights and discards the KV cache.
+  - Memory: Model weights are moved to CPU memory; KV cache is forgotten.
+  - Use Case: Suitable when reusing the same model later.
+  - Note: Ensure sufficient CPU memory is available to hold the model weights.
 
 - Level 2 Sleep
-    - Action: Discards both model weights and KV cache.
-    - Memory: The content of both the model weights and kv cache is forgotten.
-    - Use Case: Ideal when switching to a different model or updating the current one.
+  - Action: Discards both model weights and KV cache.
+  - Memory: The content of both the model weights and kv cache is forgotten.
+  - Use Case: Ideal when switching to a different model or updating the current one.
 
 Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.
 

diff --git a/examples/run_dp_server.sh b/examples/run_dp_server.sh
@@ -27,4 +27,4 @@ vllm serve /data/weights/Qwen2.5-0.5B-Instruct \
     --max-model-len 2000 \
     --max-num-batched-tokens 2000 \
     --trust-remote-code \
-    --gpu-memory-utilization 0.9 \
+    --gpu-memory-utilization 0.9