Skip to content

[v0.9.1][DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 #1247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/accuracy_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,8 @@ jobs:

- name: Install vllm-project/vllm-ascend
working-directory: ./vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -e .
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/image_openeuler.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ on:
- '.github/workflows/image_openeuler.yml'
- 'Dockerfile.openEuler'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/image_ubuntu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ on:
- '.github/workflows/image_ubuntu.yml'
- 'Dockerfile'
- 'vllm_ascend/**'
- 'setup.py'
- 'pyproject.toml'
- 'requirements.txt'
- 'cmake/**'
- 'CMakeLists.txt'
- 'csrc/**'
push:
# Publish image when tagging, the Dockerfile in tag will be build as tag image
branches:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/nightly_benchmarks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,8 @@ jobs:
VLLM_TARGET_DEVICE=empty pip install -e .

- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -e .
pip install -r benchmarks/requirements-bench.txt
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/vllm_ascend_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,8 @@ jobs:
VLLM_TARGET_DEVICE=empty pip install -e .

- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/vllm_ascend_test_long_term.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ jobs:
VLLM_TARGET_DEVICE=empty pip install -e .

- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/vllm_ascend_test_pd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ jobs:
VLLM_TARGET_DEVICE=empty pip install -e .

- name: Install vllm-project/vllm-ascend
env:
PIP_EXTRA_INDEX_URL: https://mirrors.huaweicloud.com/ascend/repos/pypi
run: |
pip install -r requirements-dev.txt
pip install -v -e .
Expand Down
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -v -e /vllm-workspace/vllm

# Install vllm-ascend
# Append `libascend_hal.so` path (devlib) to LD_LIBRARY_PATH
RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
Expand Down
3 changes: 2 additions & 1 deletion Dockerfile.openEuler
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@ RUN VLLM_TARGET_DEVICE="empty" python3 -m pip install -e /vllm-workspace/vllm/ -
python3 -m pip cache purge

# Install vllm-ascend
RUN source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
RUN export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi && \
source /usr/local/Ascend/ascend-toolkit/set_env.sh && \
source /usr/local/Ascend/nnal/atb/set_env.sh && \
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/Ascend/ascend-toolkit/latest/`uname -i`-linux/devlib && \
python3 -m pip install -v -e /vllm-workspace/vllm-ascend/ --extra-index https://download.pytorch.org/whl/cpu/ && \
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ By using vLLM Ascend plugin, popular open-source models, including Transformer-l
- Software:
* Python >= 3.9, < 3.12
* CANN >= 8.1.RC1
* PyTorch >= 2.5.1, torch-npu >= 2.5.1
* PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250528
* vLLM (the same version as vllm-ascend)

## Getting Started
Expand Down
2 changes: 1 addition & 1 deletion README.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ vLLM 昇腾插件 (`vllm-ascend`) 是一个由社区维护的让vLLM在Ascend NP
- 软件:
* Python >= 3.9, < 3.12
* CANN >= 8.1.RC1
* PyTorch >= 2.5.1, torch-npu >= 2.5.1
* PyTorch >= 2.5.1, torch-npu >= 2.5.1.post1.dev20250528
* vLLM (与vllm-ascend版本一致)

## 开始使用
Expand Down
11 changes: 6 additions & 5 deletions docs/source/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ This document describes how to install vllm-ascend manually.
- A hardware with Ascend NPU. It's usually the Atlas 800 A2 series.
- Software:

| Software | Supported version | Note |
|-----------|-------------------|----------------------------------------|
| CANN | >= 8.1.RC1 | Required for vllm-ascend and torch-npu |
| torch-npu | >= 2.5.1 | Required for vllm-ascend |
| torch | >= 2.5.1 | Required for torch-npu and vllm |
| Software | Supported version | Note |
|---------------|----------------------------------|-------------------------------------------|
| CANN | >= 8.1.RC1 | Required for vllm-ascend and torch-npu |
| torch-npu | >= 2.5.1.post1.dev20250528 | Required for vllm-ascend |
| torch | >= 2.5.1 | Required for torch-npu and vllm |

You have 2 way to install:
- **Using pip**: first prepare env manually or via CANN image, then install `vllm-ascend` using pip.
Expand Down Expand Up @@ -156,6 +156,7 @@ cd ..
# Install vLLM Ascend
git clone --depth 1 --branch |vllm_ascend_version| https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend
export PIP_EXTRA_INDEX_URL=https://mirrors.huaweicloud.com/ascend/repos/pypi
pip install -v -e .
cd ..
```
Expand Down
6 changes: 5 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ pyyaml
scipy
setuptools>=64
setuptools-scm>=8
torch-npu==2.5.1
torch>=2.5.1
torchvision<0.21.0
wheel
Expand All @@ -21,3 +20,8 @@ quart

# Required for N-gram speculative decoding
numba

# Install torch_npu
--pre
--extra-index-url https://mirrors.huaweicloud.com/ascend/repos/pypi
torch-npu==2.5.1.post1.dev20250528
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ def configure(self, ext: CMakeExtension) -> None:
# if pybind11 is installed via pip
pybind11_cmake_path = (subprocess.check_output(
[python_executable, "-m", "pybind11",
"--cmake"]).decode().strip())
"--cmakedir"]).decode().strip())
except subprocess.CalledProcessError as e:
# else specify pybind11 path installed from source code on CI container
raise RuntimeError(f"CMake configuration failed: {e}")
Expand Down
66 changes: 66 additions & 0 deletions tests/multicard/test_data_parallel.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
"""
Compare the outputs of vLLM with and without aclgraph.
Run `pytest tests/multicard/test_data_parallel.py`.
"""

import os

import pytest

from tests.conftest import VllmRunner
from tests.model_utils import check_outputs_equal

MODELS = ["Qwen/Qwen2.5-0.5B-Instruct"]


@pytest.mark.skipif(True, reason="OPEN ME when dp is supported on A2")
@pytest.mark.skipif(os.getenv("VLLM_USE_V1") == "0",
reason="Data parallel only support on v1")
@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("max_tokens", [32])
def test_data_parallel_correctness(
model: str,
max_tokens: int,
) -> None:
example_prompts = [
"Hello, my name is", "The president of the United States is",
"The capital of France is", "The future of AI is"
]

with VllmRunner(model_name=model,
max_model_len=1024,
max_num_seqs=16,
data_parallel_size=2,
distributed_executor_backend="mp") as vllm_model:
vllm_dp_outputs = vllm_model.generate_greedy(example_prompts,
max_tokens)

with VllmRunner(
model_name=model,
max_model_len=1024,
max_num_seqs=16,
) as vllm_model:
vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)

check_outputs_equal(
outputs_0_lst=vllm_outputs,
outputs_1_lst=vllm_dp_outputs,
name_0="vllm_outputs",
name_1="vllm_dp_outputs",
)
18 changes: 1 addition & 17 deletions vllm_ascend/patch/platform/patch_common/patch_distributed.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,9 @@
import vllm.distributed
import vllm.envs as envs
from torch.distributed import ProcessGroup
from vllm.config import ParallelConfig, VllmConfig
from vllm.config import ParallelConfig
from vllm.distributed.utils import \
stateless_init_torch_distributed_process_group
from vllm.v1.engine.core import DPEngineCoreProc


def ascend_destroy_model_parallel():
Expand Down Expand Up @@ -79,21 +78,6 @@ def stateless_init_dp_group(self) -> "ProcessGroup":
return dp_group


def _init_data_parallel(self, vllm_config: VllmConfig):
# Configure NPUs and stateless process group for data parallel.
dp_rank = vllm_config.parallel_config.data_parallel_rank
dp_size = vllm_config.parallel_config.data_parallel_size
local_dp_rank = vllm_config.parallel_config.data_parallel_rank_local

assert dp_size > 1
assert 0 <= local_dp_rank <= dp_rank < dp_size

self.local_dp_rank = local_dp_rank
self.dp_group = vllm_config.parallel_config.stateless_init_dp_group()
self.current_wave = 0


vllm.distributed.parallel_state.destroy_model_parallel = ascend_destroy_model_parallel
DPEngineCoreProc._init_data_parallel = _init_data_parallel
ParallelConfig.get_next_dp_init_port = parallel_config_get_dp_port
ParallelConfig.stateless_init_dp_group = stateless_init_dp_group
8 changes: 1 addition & 7 deletions vllm_ascend/worker/worker_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,12 +75,6 @@ def __init__(
distributed_init_method=distributed_init_method,
is_driver_worker=is_driver_worker)

# NOTE(Yizhou): Since we do not set ASCEND_RT_VISIBLE_DEVICES in
# vllm_ascend, we need to set the device id manually.
local_dp_rank = self.vllm_config.parallel_config.data_parallel_rank_local
world_size = self.vllm_config.parallel_config.world_size
self.local_rank_across_dp = local_dp_rank * world_size + self.local_rank

# Try to import mindie_turbo to accelerate vLLM inference.
try_register_lib(
"mindie_turbo",
Expand Down Expand Up @@ -124,7 +118,7 @@ def initialize_cache(self, num_gpu_blocks: int,

def init_device(self):
if self.device_config.device.type == "npu":
self.device = torch.device(f"npu:{self.local_rank_across_dp}")
self.device = torch.device(f"npu:{self.local_rank}")
NPUPlatform.set_device(self.device)
NPUPlatform.empty_cache()
self.init_npu_memory = NPUPlatform.mem_get_info()[0]
Expand Down