draft of dual stream launch #842

ganyi1996ppo · 2025-05-14T02:23:06Z

What this PR does / why we need it?

The moe layer in sparse model like deepseek tends to bring significant communication overhead over the computation part, the communication result directly comsumed by the routing expert, which makes the computation resources keeped idle in a long time even if the shared expert exeute in parallel. According to the DeepEP indicate in their repo, slice the input into several microbatch and makes the attention computation overlapped with another microbatch's communication part can reduce this overhead significantly, thus bring performance boost. We here purpose a draft on the dual stream launch implementation.

As the figure shows, we slice the input into 2 microbatch in model_runner, and maintain a threadpool to launch each stream on different thread. The second stream's execution can only be triggered when first stream have finished its first attention task.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

draft of duel stream launch

196ae89

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

ganyi1996ppo changed the title ~~draft of duel stream launch~~ draft of dual stream launch May 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

draft of dual stream launch #842

draft of dual stream launch #842

ganyi1996ppo commented May 14, 2025 •

edited

Loading

draft of dual stream launch #842

Are you sure you want to change the base?

draft of dual stream launch #842

Conversation

ganyi1996ppo commented May 14, 2025 • edited Loading

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

ganyi1996ppo commented May 14, 2025 •

edited

Loading