Skip to content

[v0.9.1][DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 #1247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 17, 2025

Conversation

MengqingCao
Copy link
Collaborator

What this PR does / why we need it?

Cherry-pick form #1235

  1. Fix rank set in DP scenario. The new poc version of torch-npu support setting ASCEND_RT_VISIBLE_DEVICES dynamically, thus we could use the rank set in DPEngineCoreProc directly instead of calculating local rank across dp by hand in the patched _init_data_parallel

Closes: #1170

  1. Bump torch-npu version to 2.5.1.post1.dev20250528

Closes: #1242
Closes: #1232

How was this patch tested?

CI passed with new added test.

@github-actions github-actions bot added documentation Improvements or additions to documentation ci/build module:tests labels Jun 16, 2025
@MengqingCao MengqingCao changed the title https://github.com/vllm-project/vllm-ascend/pull/1235 [DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 Jun 16, 2025
@Yikun Yikun changed the title [DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 [v0.9.1][DP][V1] Fix rank set in DP scenario & Bump torch-npu version to 2.5.1.post1.dev20250528 Jun 16, 2025
Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a soft cherrypick

@MengqingCao
Copy link
Collaborator Author

This should merged after #1234

@MengqingCao
Copy link
Collaborator Author

CI failed due to not compatible with V0.9.0, will fix this later

@ganyi1996ppo
Copy link
Collaborator

Looks good

@MengqingCao MengqingCao force-pushed the dpfix091 branch 2 times, most recently from 5c6c5bc to b45f0ab Compare June 17, 2025 09:51
@MengqingCao
Copy link
Collaborator Author

@ganyi1996ppo DP will raise timeout error on A2 with this pr, thus we just skip the new added ut currently. And this could fix DP on A3, could you merge this now? The CI except for dp have all passed in https://github.com/vllm-project/vllm-ascend/actions/runs/15700818646/job/44238373529, p.s., tests/multicard/test_dynamic_npugraph_batchsize.py::test_models[True-0.0-64-2-Qwen/Qwen2.5-0.5B-Instruct] failed due to the failure of tests/multicard/test_data_parallel.py::test_data_parallel_correctness[32-Qwen/Qwen2.5-0.5B-Instruct], which dosen't tear down correctly

@ganyi1996ppo
Copy link
Collaborator

@ganyi1996ppo DP will raise timeout error on A2 with this pr, thus we just skip the new added ut currently. And this could fix DP on A3, could you merge this now? The CI except for dp have all passed in https://github.com/vllm-project/vllm-ascend/actions/runs/15700818646/job/44238373529, p.s., tests/multicard/test_dynamic_npugraph_batchsize.py::test_models[True-0.0-64-2-Qwen/Qwen2.5-0.5B-Instruct] failed due to the failure of tests/multicard/test_data_parallel.py::test_data_parallel_correctness[32-Qwen/Qwen2.5-0.5B-Instruct], which dosen't tear down correctly

Why a2 failed on the dp case? Dose this failure related to torch_npu?

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wxsIcey and others added 3 commits June 17, 2025 11:30
Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
@MengqingCao
Copy link
Collaborator Author

MengqingCao commented Jun 17, 2025

Why a2 failed on the dp case? Dose this failure related to torch_npu?

I'll look into this and fix it in next pr. Let's merge this for fixing on A3 firstly

Sorry for the wrong info, there is no bug on A2.

This failure is caused by the wrong method of enabling dp in this pr. And this has been fixed in #1273 cc @ganyi1996ppo

@ganyi1996ppo ganyi1996ppo merged commit d798125 into vllm-project:v0.9.1-dev Jun 17, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation module:tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants