Skip to content

[2/N] Lint enhancements #1735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/format_pr_body.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
#!/bin/bash
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
Expand All @@ -15,7 +16,6 @@
# This file is a part of the vllm-ascend project.
# Adapted from vllm/.github/scripts/cleanup_pr_body.sh

#!/bin/bash

set -eux

Expand Down
13 changes: 6 additions & 7 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ default_install_hook_types:
default_stages:
- pre-commit # Run locally
- manual # Run in CI
exclude: 'examples/.*' # Exclude examples from all hooks by default
exclude: '^(examples/.*|vllm-empty/.*)$' # Exclude examples from all hooks by default
repos:
- repo: https://github.com/codespell-project/codespell
rev: v2.4.1
Expand Down Expand Up @@ -92,12 +92,11 @@ repos:
language: system
types: [python]
stages: [manual] # Only run in CI
# FIXME: enable shellcheck
# - id: shellcheck
# name: Lint shell scripts
# entry: tools/shellcheck.sh
# language: script
# types: [shell]
- id: shellcheck
name: Lint shell scripts
entry: tools/shellcheck.sh
language: script
types: [shell]
- id: png-lint
name: Lint PNG exports from excalidraw
entry: tools/png-lint.sh
Expand Down
54 changes: 27 additions & 27 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,25 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
# Overview
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
- Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: end-to-end latency (mean, median, p99).

- Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput.
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput.
- Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

**Benchmarking Duration**: about 800 senond for single model.

Expand Down Expand Up @@ -75,10 +75,10 @@ Before running the benchmarks, ensure the following:

this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).

- **Test Overview**
- Test Name: serving_qwen2_5vl_7B_tp1
- **Test Overview**
- Test Name: serving_qwen2_5vl_7B_tp1

- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).

- Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct
Expand All @@ -95,21 +95,21 @@ this Json will be structured and parsed into server parameters and client parame

- Max Model Length: 16,384 tokens (maximum context length supported by the model)

- Client Parameters
- Client Parameters

- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)

- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)

- Dataset Source: Hugging Face (hf)
- Dataset Source: Hugging Face (hf)

- Dataset Split: train
- Dataset Split: train

- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)

- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)

- Number of Prompts: 200 (the total number of prompts used during the test)
- Number of Prompts: 200 (the total number of prompts used during the test)

## Run benchmarks

Expand Down
14 changes: 3 additions & 11 deletions benchmarks/ops/ben_vocabparallelembedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,17 +50,9 @@ def get_masked_input_and_mask_ref(
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Reference implementation for verification"""
org_vocab_mask = (input_ >= org_vocab_start_index) & (input_ < org_vocab_end_index)
added_vocab_mask = (input_ >= added_vocab_start_index) & (
input_ < added_vocab_end_index
)
added_offset = (
added_vocab_start_index
- (org_vocab_end_index - org_vocab_start_index)
- num_org_vocab_padding
)
valid_offset = (org_vocab_start_index * org_vocab_mask) + (
added_offset * added_vocab_mask
)
added_vocab_mask = (input_ >= added_vocab_start_index) & (input_ < added_vocab_end_index)
added_offset = added_vocab_start_index - (org_vocab_end_index - org_vocab_start_index) - num_org_vocab_padding
valid_offset = (org_vocab_start_index * org_vocab_mask) + (added_offset * added_vocab_mask)
vocab_mask = org_vocab_mask | added_vocab_mask
masked_input = vocab_mask * (input_ - valid_offset)
return masked_input, ~vocab_mask
Expand Down
50 changes: 16 additions & 34 deletions benchmarks/scripts/convert_json_to_markdown.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,7 @@ def results_to_json(latency, throughput, serving):


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Process the results of the benchmark tests."
)
parser = argparse.ArgumentParser(description="Process the results of the benchmark tests.")
parser.add_argument(
"--results_folder",
type=str,
Expand All @@ -80,11 +78,11 @@ def results_to_json(latency, throughput, serving):
default="./perf_result_template.md",
help="The template file for the markdown report.",
)
parser.add_argument("--tag", default="main", help="Tag to be used for release message.")
parser.add_argument(
"--tag", default="main", help="Tag to be used for release message."
)
parser.add_argument(
"--commit_id", default="", help="Commit ID to be used for release message."
"--commit_id",
default="",
help="Commit ID to be used for release message.",
)

args = parser.parse_args()
Expand Down Expand Up @@ -116,9 +114,7 @@ def results_to_json(latency, throughput, serving):
# get different percentiles
for perc in [10, 25, 50, 75, 90, 99]:
# Multiply 1000 to convert the time unit from s to ms
raw_result.update(
{f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]}
)
raw_result.update({f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]})
raw_result["avg_latency"] = raw_result["avg_latency"] * 1000

# add the result to raw_result
Expand All @@ -142,38 +138,24 @@ def results_to_json(latency, throughput, serving):
serving_results = pd.DataFrame.from_dict(serving_results)
throughput_results = pd.DataFrame.from_dict(throughput_results)

raw_results_json = results_to_json(
latency_results, throughput_results, serving_results
)
raw_results_json = results_to_json(latency_results, throughput_results, serving_results)

# remapping the key, for visualization purpose
if not latency_results.empty:
latency_results = latency_results[list(latency_column_mapping.keys())].rename(
columns=latency_column_mapping
)
latency_results = latency_results[list(latency_column_mapping.keys())].rename(columns=latency_column_mapping)
if not serving_results.empty:
serving_results = serving_results[list(serving_column_mapping.keys())].rename(
columns=serving_column_mapping
)
serving_results = serving_results[list(serving_column_mapping.keys())].rename(columns=serving_column_mapping)
if not throughput_results.empty:
throughput_results = throughput_results[
list(throughput_results_column_mapping.keys())
].rename(columns=throughput_results_column_mapping)
throughput_results = throughput_results[list(throughput_results_column_mapping.keys())].rename(
columns=throughput_results_column_mapping
)

processed_results_json = results_to_json(
latency_results, throughput_results, serving_results
)
processed_results_json = results_to_json(latency_results, throughput_results, serving_results)

# get markdown tables
latency_md_table = tabulate(
latency_results, headers="keys", tablefmt="pipe", showindex=False
)
serving_md_table = tabulate(
serving_results, headers="keys", tablefmt="pipe", showindex=False
)
throughput_md_table = tabulate(
throughput_results, headers="keys", tablefmt="pipe", showindex=False
)
latency_md_table = tabulate(latency_results, headers="keys", tablefmt="pipe", showindex=False)
serving_md_table = tabulate(serving_results, headers="keys", tablefmt="pipe", showindex=False)
throughput_md_table = tabulate(throughput_results, headers="keys", tablefmt="pipe", showindex=False)

# document the result
print(output_folder)
Expand Down
3 changes: 1 addition & 2 deletions benchmarks/scripts/run-performance-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,7 @@ ensure_sharegpt_downloaded() {
if [ ! -f "$FILE" ]; then
echo "$FILE not found, downloading from hf-mirror ..."
mkdir -p "$DIR"
wget -O "$FILE" https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
if [ $? -ne 0 ]; then
if ! wget -O "$FILE" https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json; then
echo "Download failed!" >&2
return 1
fi
Expand Down
43 changes: 6 additions & 37 deletions benchmarks/scripts/run_accuracy.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,39 +206,16 @@ def generate_md(model_name, tasks_list, args, datasets):
else:
n_shot = "0"
flag = ACCURACY_FLAG.get(task_name, "")
row = (
f"| {task_name:<37} "
f"| {flt:<6} "
f"| {n_shot:6} "
f"| {metric:<6} "
f"| {flag}{value:>5.4f} "
f"| ± {stderr:>5.4f} |"
)
row = f"| {task_name:<37} | {flt:<6} | {n_shot:6} | {metric:<6} | {flag}{value:>5.4f} | ± {stderr:>5.4f} |"
if not task_name.startswith("-"):
rows.append(row)
rows_sub.append(
"<details>"
+ "\n"
+ "<summary>"
+ task_name
+ " details"
+ "</summary>"
+ "\n" * 2
+ header
"<details>" + "\n" + "<summary>" + task_name + " details" + "</summary>" + "\n" * 2 + header
)
rows_sub.append(row)
rows_sub.append("</details>")
# Combine all Markdown sections
md = (
preamble
+ "\n"
+ header
+ "\n"
+ "\n".join(rows)
+ "\n"
+ "\n".join(rows_sub)
+ "\n"
)
md = preamble + "\n" + header + "\n" + "\n".join(rows) + "\n" + "\n".join(rows_sub) + "\n"
print(md)
return md

Expand Down Expand Up @@ -268,9 +245,7 @@ def main(args):
# Evaluate model on each dataset
for dataset in datasets:
accuracy_expected = EXPECTED_VALUE[args.model][dataset]
p = multiprocessing.Process(
target=run_accuracy_test, args=(result_queue, args.model, dataset)
)
p = multiprocessing.Process(target=run_accuracy_test, args=(result_queue, args.model, dataset))
p.start()
p.join()
if p.is_alive():
Expand All @@ -281,11 +256,7 @@ def main(args):
time.sleep(10)
result = result_queue.get()
print(result)
if (
accuracy_expected - RTOL
< result[dataset][FILTER[dataset]]
< accuracy_expected + RTOL
):
if accuracy_expected - RTOL < result[dataset][FILTER[dataset]] < accuracy_expected + RTOL:
ACCURACY_FLAG[dataset] = "✅"
else:
ACCURACY_FLAG[dataset] = "❌"
Expand All @@ -297,9 +268,7 @@ def main(args):
if __name__ == "__main__":
multiprocessing.set_start_method("spawn", force=True)
# Initialize argument parser
parser = argparse.ArgumentParser(
description="Run model accuracy evaluation and generate report"
)
parser = argparse.ArgumentParser(description="Run model accuracy evaluation and generate report")
parser.add_argument("--output", type=str, required=True)
parser.add_argument("--model", type=str, required=True)
parser.add_argument("--vllm_ascend_version", type=str, required=False)
Expand Down
6 changes: 3 additions & 3 deletions docs/source/community/governance.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
**Responsibility:** Develop the project's vision and mission. Maintainers are responsible for driving the technical direction of the entire project and ensuring its overall success, possessing code merge permissions. They formulate the roadmap, review contributions from community members, continuously contribute code, and actively engage in community activities (such as regular meetings/events).

**Requirements:** Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ codebases, with a commitment to sustained code contributions. Competency in ‌design/development/PR review workflows‌.
- **Review Quality‌:** Actively participate in community code reviews, ensuring high-quality code integration.
- **Quality Contribution‌:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
- **Community Involvement‌:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.
- **Review Quality‌:** Actively participate in community code reviews, ensuring high-quality code integration.
- **Quality Contribution‌:** Successfully develop and deliver at least one major feature while maintaining consistent high-quality contributions.
- **Community Involvement‌:** Actively address issues, respond to forum inquiries, participate in discussions, and engage in community-driven tasks.

Requires approval from existing Maintainers. The vLLM community has the final decision-making authority.

Expand Down
14 changes: 7 additions & 7 deletions docs/source/user_guide/feature_guide/sleep_mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm wi
The engine(v0/v1) supports two sleep levels to manage memory during idle periods:

- Level 1 Sleep
- Action: Offloads model weights and discards the KV cache.
- Memory: Model weights are moved to CPU memory; KV cache is forgotten.
- Use Case: Suitable when reusing the same model later.
- Note: Ensure sufficient CPU memory is available to hold the model weights.
- Action: Offloads model weights and discards the KV cache.
- Memory: Model weights are moved to CPU memory; KV cache is forgotten.
- Use Case: Suitable when reusing the same model later.
- Note: Ensure sufficient CPU memory is available to hold the model weights.

- Level 2 Sleep
- Action: Discards both model weights and KV cache.
- Memory: The content of both the model weights and kv cache is forgotten.
- Use Case: Ideal when switching to a different model or updating the current one.
- Action: Discards both model weights and KV cache.
- Memory: The content of both the model weights and kv cache is forgotten.
- Use Case: Ideal when switching to a different model or updating the current one.

Since this feature uses the low-level API [AscendCL](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/82RC1alpha002/API/appdevgapi/appdevgapi_07_0000.html), in order to use sleep mode, you should follow the [installation guide](https://vllm-ascend.readthedocs.io/en/latest/installation.html) and building from source, if you are using v0.7.3, remember to set `export COMPILE_CUSTOM_KERNELS=1`, for the latest version(v0.9.x+), the environment variable `COMPILE_CUSTOM_KERNELS` will be set 1 by default while building from source.

Expand Down
2 changes: 1 addition & 1 deletion examples/run_dp_server.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ vllm serve /data/weights/Qwen2.5-0.5B-Instruct \
--max-model-len 2000 \
--max-num-batched-tokens 2000 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--gpu-memory-utilization 0.9
Loading
Loading