[bugfix] Add ep initialization check and change the return check to is_driver_worker #896

NINGBENZHE · 2025-05-19T07:21:31Z

What this PR does / why we need it?

Solve the bug that the graph mode is the same as p and d, and some other bugs.

Does this PR introduce any user-facing change?

Wouldn't be

How was this patch tested?

Follow the end-to-end test

jianzs · 2025-05-19T16:19:34Z

vllm_ascend/patch/platform/patch_common/patch_distributed.py

-    port = int(os.environ.get("MASTER_PORT", answer))  # type: ignore
+    port = int(os.environ.get("VLLM_DP_MASTER_PORT", answer))  # type: ignore


using envs.VLLM_DP_MASTER_PORT is better?

MengqingCao · 2025-05-28T09:12:46Z

vllm_ascend/patch/platform/patch_common/patch_distributed.py

 from torch.distributed import ProcessGroup
 from torch.distributed.distributed_c10d import (Backend, PrefixStore,
                                                _get_default_timeout,
                                                is_nccl_available)
 from torch.distributed.rendezvous import rendezvous
 from vllm.config import ParallelConfig

+_DP_GROUP = None


There is still process group for dp in vllm now, why we add this here?

This is used to determine whether to execute the dummy_run of prefill process. The native stateless process does not have global variables to obtain.

MengqingCao · 2025-05-28T09:17:29Z

vllm_ascend/distributed/parallel_state.py

@@ -21,12 +21,18 @@ def get_etp_group() -> GroupCoordinator:
    return _ETP


+def model_parallel_initialized():
+    return (_ETP is not None and _EP is not None)


I think we could use ep without etp, thus this will break this senario

No. If ETP is not enabled, communication groups will still be created.

MengqingCao · 2025-05-29T02:49:49Z

@NINGBENZHE Thanks for the details!

FYI, vllm-project/vllm#18763 has been merged now. But I think global dp group maybe still should be kept in patch_distiributed.py currently. Because the dp group is created at ascend_stateless_init_dp_group, instead of stateless_init_torch_distributed_process_group

MengqingCao · 2025-05-29T02:52:38Z

@zzzzwwjj please take a look at this pr, it fixes a bug on graph mode

ganyi1996ppo · 2025-05-29T06:17:18Z

vllm_ascend/worker/model_runner_v1.py

-                    intermediate_tensors=intermediate_tensors,
-                    inputs_embeds=inputs_embeds)
+                if self.enable_torchair_graph_mode and attn_state == AscendAttentionState.DecodeOnly:
+                    attn_metadata = self.attn_metadata_builder.dummy_build(


seem can't find where this dummy_build implementation

this pr:
#839

ganyi1996ppo · 2025-05-29T06:26:19Z

vllm_ascend/worker/model_runner_v1.py

+                        num_reqs=num_tokens, num_actual_tokens=1)
+                    torch._dynamo.mark_static(input_ids)
+                    torch._dynamo.mark_static(positions)
+                    torch._dynamo.mark_static(attn_metadata.decode.block_table)


How do you implement this dummy block_table? Did you request the free block id from the sheduler?

this pr:
#839

ganyi1996ppo · 2025-05-29T06:38:46Z

vllm_ascend/worker/model_runner_v1.py

+                attn_metadata.attn_state == AscendAttentionState.DecodeOnly)
+
+        if self.dp_group:
+            while not self.has_prefilled and self.enable_torchair_graph_mode and attn_metadata.attn_state == AscendAttentionState.DecodeOnly:


This part is quiet complicate. This is aim for mix running for decode and prefill across dp rank right? Can you extract this part out with more simple strategy. This part may trigger some unexpected error when someone try to update this.

fixed. It will be removed along with its related calls once the official solution is implemented.

ganyi1996ppo · 2025-05-29T08:15:17Z

vllm_ascend/worker/model_runner_v1.py

@@ -624,6 +629,9 @@ def _process_reqs(
            input_ids = torch.cat([input_ids, padding])
            positions = torch.cat([positions, padding])

+        if self.enable_torchair_graph_mode:
+            self.sync_prefill_when_enable_graph(attn_metadata)


Can you add more comments here? this sync between prefill and decode code piece will be removed in the future for more robust implementation.

ganyi1996ppo · 2025-05-29T08:23:31Z

This implementation relay on the PR #839 , and is more like a work round for emergency task, just discussed with @NINGBENZHE , some of the main code piece will wrapped into some function and will be removed after more robust implementation delivered. This may helped on the code maintenance and prevent us from diverge vllm main tree too far.

ganyi1996ppo · 2025-05-29T08:40:57Z

@wangxiyuan @Yikun Please aware this

This implementation relay on the PR #839 , and is more like a work round for emergency task, just discussed with @NINGBENZHE , some of the main code piece will wrapped into some function and will be removed after more robust implementation delivered. This may helped on the code maintenance and prevent us from diverge vllm main tree too far.

ganyi1996ppo · 2025-05-29T08:43:37Z

vllm_ascend/worker/model_runner_v1.py

@@ -685,6 +693,41 @@ def _process_reqs(
        return (attn_metadata, hidden_states, spec_decode_metadata, positions,
                total_num_scheduled_tokens, sample_indices)

+    def sync_prefill_when_enable_graph(self, attn_metadata):


This part will be removed after chunk mc2 support, cc @zzzzwwjj

NeverRaR · 2025-05-29T09:26:38Z

DP support will soon be implemented in the next PR, so I think this workaround can temporarily not be merge to the main?

NINGBENZHE · 2025-05-29T09:37:15Z

DP support will soon be implemented in the next PR, so I think this workaround can temporarily not be merge to the main?

When will the next pr be incorporated?

NeverRaR · 2025-05-29T09:42:00Z

DP support will soon be implemented in the next PR, so I think this workaround can temporarily not be merge to the main?

When will the next pr be incorporated?

after #839 merged

NINGBENZHE · 2025-05-29T09:53:04Z

DP 支持将很快在下一个 PR 中实现，所以我认为这个解决方法暂时不能合并到主版本中？

下一个 pr 什么时候会被纳入？

after #839 merged

Would it be possible to merge this first since another feature depends on the current functionality? When you submit the next PR, you could remove the related content later.

NeverRaR · 2025-05-29T10:12:50Z

Would it be possible to merge this first since another feature depends on the current functionality? When you submit the next PR, you could remove the related content later.

If this PR is so urgent, perhaps you can add an option to disable these changes by default?

ganyi1996ppo · 2025-05-29T10:40:50Z

@wangxiyuan can you look at this PR to decide whether this PR needs to be merge?

NINGBENZHE · 2025-05-29T10:43:04Z

Would it be possible to merge this first since another feature depends on the current functionality? When you submit the next PR, you could remove the related content later.

If this PR is so urgent, perhaps you can add an option to disable these changes by default?

The code is written in model_runner_v1 and remarks are made.

Yikun · 2025-05-29T10:53:08Z

I prefer to merge #839 first and do a quick review on #1012 , then we don't need this PR anymore.

Because v0.9.0rc1 should also support #1012 IMO

The dp dummy run support on torchair will adopt another PR, this PR will not merge in vllm-ascend

wangxiyuan · 2025-05-30T03:45:08Z

vllm_ascend/ops/fused_moe.py

@@ -66,8 +66,7 @@ def fused_experts_with_mc2(
    local_rank = torch.distributed.get_rank(group=ep_group)
    all_to_all_group_size = torch.distributed.get_world_size(ep_group)

-    world_szie = torch.distributed.get_world_size()
-    tp_size = world_szie // all_to_all_group_size
+    tp_size = get_etp_group().world_size


@ganyi1996ppo please double check this change.

wangxiyuan · 2025-05-30T03:45:49Z

dp related change has been removed. I'm fine with this PR. thanks for the fix

Signed-off-by: ningbenzhe1 <ningbenzhe@huawei.com>

### What this PR does / why we need it? Solve the bug that the graph mode is the same as p and d, and some other bugs. ### Does this PR introduce _any_ user-facing change? Wouldn't be ### How was this patch tested? Follow the end-to-end test Signed-off-by: ningbenzhe1 <ningbenzhe@huawei.com> Signed-off-by: wangxiaoxin (A) <w00664509@china.huawei.com>

Yikun · 2025-06-03T13:59:13Z

Please update PR title and mesasge more meaningful.

### What this PR does / why we need it? Solve the bug that the graph mode is the same as p and d, and some other bugs. ### Does this PR introduce _any_ user-facing change? Wouldn't be ### How was this patch tested? Follow the end-to-end test Signed-off-by: ningbenzhe1 <ningbenzhe@huawei.com>

### What this PR does / why we need it? Solve the bug that the graph mode is the same as p and d, and some other bugs. ### Does this PR introduce _any_ user-facing change? Wouldn't be ### How was this patch tested? Follow the end-to-end test Signed-off-by: ningbenzhe1 <ningbenzhe@huawei.com> Signed-off-by: wangxiaoxin (A) <w00664509@china.huawei.com>

NINGBENZHE · 2025-06-04T09:18:10Z

Please update PR title and mesasge more meaningful.

fixed

jianzs reviewed May 19, 2025

View reviewed changes

NINGBENZHE force-pushed the main branch 3 times, most recently from 9df5c0f to 293eefe Compare May 26, 2025 01:10

github-actions bot added the module:ops label May 26, 2025

NINGBENZHE force-pushed the main branch from 293eefe to 115b0e1 Compare May 28, 2025 09:13

MengqingCao reviewed May 28, 2025

View reviewed changes

NINGBENZHE closed this May 28, 2025

NINGBENZHE force-pushed the main branch from 115b0e1 to e2a0c19 Compare May 28, 2025 11:18

NINGBENZHE reopened this May 28, 2025

ganyi1996ppo reviewed May 29, 2025

View reviewed changes

NINGBENZHE force-pushed the main branch from a0af061 to 21a7023 Compare May 29, 2025 08:09

ganyi1996ppo reviewed May 29, 2025

View reviewed changes

ganyi1996ppo approved these changes May 29, 2025

View reviewed changes

ganyi1996ppo reviewed May 29, 2025

View reviewed changes

ganyi1996ppo previously approved these changes May 29, 2025

View reviewed changes

NINGBENZHE force-pushed the main branch 3 times, most recently from 2a308fa to 552daf0 Compare May 30, 2025 02:23

github-actions bot added the module:tests label May 30, 2025

wangxiyuan approved these changes May 30, 2025

View reviewed changes

wangxiyuan mentioned this pull request May 30, 2025

[Bugfix][V1] Quick fix on ouput return of model execution #1027

Closed

MengqingCao mentioned this pull request May 30, 2025

[Bug]: When use v1_engine func "execute_model", return None #960

Open

ganyi1996ppo approved these changes May 30, 2025

View reviewed changes

fix some bugs

47443e7

Signed-off-by: ningbenzhe1 <ningbenzhe@huawei.com>

NINGBENZHE force-pushed the main branch from f9b6cb7 to 47443e7 Compare June 3, 2025 01:33

wangxiyuan merged commit 6ec64a3 into vllm-project:main Jun 3, 2025
20 of 22 checks passed

NINGBENZHE changed the title ~~[bugfix] some bugs maybe fail to run~~ [bugfix] Add ep initialization check and change the return check to is_driver_worker Jun 4, 2025

		port = int(os.environ.get("MASTER_PORT", answer)) # type: ignore
		port = int(os.environ.get("VLLM_DP_MASTER_PORT", answer)) # type: ignore

[bugfix] Add ep initialization check and change the return check to is_driver_worker #896

[bugfix] Add ep initialization check and change the return check to is_driver_worker #896

Uh oh!

Conversation

NINGBENZHE commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MengqingCao commented May 29, 2025

Uh oh!

MengqingCao commented May 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo commented May 29, 2025

Uh oh!

ganyi1996ppo commented May 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NeverRaR commented May 29, 2025

Uh oh!

NINGBENZHE commented May 29, 2025

Uh oh!

NeverRaR commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NINGBENZHE commented May 29, 2025

Uh oh!

NeverRaR commented May 29, 2025

Uh oh!

ganyi1996ppo commented May 29, 2025

Uh oh!

NINGBENZHE commented May 29, 2025

Uh oh!

Yikun commented May 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented May 30, 2025

Uh oh!

Uh oh!

Yikun commented Jun 3, 2025

Uh oh!

NINGBENZHE commented Jun 4, 2025

Uh oh!

Uh oh!

NINGBENZHE commented May 19, 2025 •

edited

Loading

NeverRaR commented May 29, 2025 •

edited

Loading