Skip to content

Support Data Parallel MOE on HPU #1022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jun 23, 2025
Merged

Support Data Parallel MOE on HPU #1022

merged 20 commits into from
Jun 23, 2025

Conversation

xinyu-intel
Copy link

@xinyu-intel xinyu-intel commented Apr 8, 2025

Based on #947

Test CML:

PT_HPU_LAZY_MODE=1 VLLM_WORKER_MULTIPROC_METHOD=spawn VLLM_USE_V1=0 VLLM_SKIP_WARMUP=true python examples/offline_inference/data_parallel.py --model="ibm-research/PowerMoE-3b" --dp-size=2 --tp-size=2

@xinyu-intel xinyu-intel force-pushed the dev/xinyu/dpmoe-pr branch 7 times, most recently from 1b3558b to 820ad1d Compare April 11, 2025 10:46
Base automatically changed from private/kzawora/rebase_mar_24 to habana_main April 18, 2025 17:21
@xinyu-intel xinyu-intel force-pushed the dev/xinyu/dpmoe-pr branch 3 times, most recently from b718e34 to 455cf52 Compare April 29, 2025 10:42
@xinyu-intel
Copy link
Author

/run-gaudi-tests

1 similar comment
@michalkuligowski
Copy link

/run-gaudi-tests

Signed-off-by: Xinyu Chen <xichen@habana.ai>
Signed-off-by: Xinyu Chen <xichen@habana.ai>
Signed-off-by: Xinyu Chen <xichen@habana.ai>
Signed-off-by: Xinyu Chen <xichen@habana.ai>
@jikunshang
Copy link

/run-gaudi-tests

@jikunshang
Copy link

/run-gaudi-tests

@xuechendi
Copy link

@xinyu-intel , please add docstring to hacked codes in llm_engine.py, in that case, when hababa_team doing rebase, they can avoid to somehow break the DP path.
And please rebase this PR

@xuechendi
Copy link

@xinyu-intel , please also add UT, I think once this PR merged, it will be quite easy to get broken during rebase.

@jikunshang
Copy link

@xinyu-intel , please also add UT, I think once this PR merged, it will be quite easy to get broken during rebase.

it's hard to add UT. There is known hang issue for mix batch scenario.

@xinyu-intel
Copy link
Author

Copy link

@xuechendi xuechendi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xuechendi
Copy link

/run-gaudi-tests

@xuechendi xuechendi enabled auto-merge (squash) June 19, 2025 01:56
@jikunshang jikunshang dismissed madamczyk-intel’s stale review June 23, 2025 02:14

upstream code do not support DP for v0 and we implement it

@xuechendi xuechendi merged commit 316f3dd into habana_main Jun 23, 2025
52 checks passed
@xuechendi xuechendi deleted the dev/xinyu/dpmoe-pr branch June 23, 2025 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants