Support multistream of MLA vector operations #1135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

ganyi1996ppo merged 1 commit into vllm-project:main from sdmyzlp:br_multi_stream_mla

Jun 12, 2025

Contributor

sdmyzlp commented Jun 9, 2025 •

edited

Loading

What this PR does / why we need it?

Move all vector operations to a secondary stream, with the expected overlaping being:

              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |

Currently, the IndexByTensor operators introduced by computation of cos and sin can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before matmul W_UQ to avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation of cos and sin up to the first layer.

Does this PR introduce any user-facing change?

Controlled by torchair_graph_config.enable_multistream_mla, defaulted to False.

How was this patch tested?

Tested on 1x16 910 node, with tailored 2 layer DSKv2.

sdmyzlp changed the title ~~Support multistream MLA~~ Support multistream of MLA vector operations

github-actions bot added documentation module:tests module:ops module:core module:quantization merge-conflicts labels

github-actions bot commented Jun 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

sdmyzlp force-pushed the br_multi_stream_mla branch 2 times, most recently from 546e6c7 to 5b20d64 Compare

June 9, 2025 13:08

github-actions bot removed the merge-conflicts label

sdmyzlp force-pushed the br_multi_stream_mla branch 2 times, most recently from d4eac5b to 4967ade Compare

June 9, 2025 14:22

github-actions bot added the merge-conflicts label

github-actions bot commented Jun 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

sdmyzlp force-pushed the br_multi_stream_mla branch from 4967ade to be59b9b Compare

June 9, 2025 22:57

github-actions bot removed the merge-conflicts label

github-actions bot commented Jun 10, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added the merge-conflicts label

sdmyzlp force-pushed the br_multi_stream_mla branch from be59b9b to dcea3aa Compare

June 10, 2025 02:02

github-actions bot removed the merge-conflicts label

sdmyzlp force-pushed the br_multi_stream_mla branch from dcea3aa to 980b356 Compare

June 10, 2025 23:35

github-actions bot added merge-conflicts labels

github-actions bot commented Jun 11, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

1 similar comment

github-actions bot commented Jun 11, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

sdmyzlp force-pushed the br_multi_stream_mla branch from 980b356 to b7e7700 Compare

June 11, 2025 02:26

github-actions bot removed merge-conflicts module:ops module:quantization labels

sdmyzlp force-pushed the br_multi_stream_mla branch from b7e7700 to fd0f6fa Compare

June 11, 2025 02:49

wangxiyuan approved these changes

View reviewed changes

github-actions bot commented Jun 11, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added the merge-conflicts label


          Offload vector operations of MLA to another stream

c89bf7b

With the expected overlaping being:
```
              | q_rmsnorm |  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV |    matmul W_UQ     | split | matmul W_KV_T |
```
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

sdmyzlp force-pushed the br_multi_stream_mla branch from fd0f6fa to c89bf7b Compare

June 11, 2025 06:21

github-actions bot removed the merge-conflicts label

Collaborator

wangxiyuan commented Jun 12, 2025

there is a pr for refactor. I suggest to block this PR until that one is merged #1169

ganyi1996ppo reviewed

View reviewed changes

vllm_ascend/attention/mla_v1.py

                               decode_k_pe, decode_k_nope = self.exec_kv(
                                   hidden_states_or_kv_c_normed, cos, sin, kv_cache,
                                   attn_metadata.slot_mapping)
+                              with npu_stream_switch("mla_secondary",

Collaborator

ganyi1996ppo Jun 12, 2025

So, this npu_stream_switch can be used inside the torchair right? if that's so, can we further optimize the dbo path of deepseek to gain more performance boost.

Contributor Author

sdmyzlp Jun 12, 2025 •

edited

Loading

Discussed with torchair colleagues, the with torch.npu.stream(...): scheme is not supported by torchair graph mode while npu_stream_switch only works under graph mode, so I recommend leaving this to later PRs:

to provide a joint version of npu_stream_switch encapsulating both cases, as well as,
adding graph mode support for dbo

ganyi1996ppo reviewed

View reviewed changes

vllm_ascend/models/deepseek_v2.py

-                          hidden_states_or_q_c = self.q_a_layernorm(ckq)
+                          use_multistream_mla = (self.enable_multistream_mla
+                                                 and attn_metadata is not None
+                                                 and attn_metadata.num_decodes > 0)

Collaborator

ganyi1996ppo Jun 12, 2025

So, multistream mla will not be triggered if there are only prefill requests?

Contributor Author

sdmyzlp Jun 12, 2025 •

edited

Loading

Yes, current multistream utilities only support graph mode, i.e. npu_stream_switch returns contextlib.nullcontext() on non-graph mode, which is the case for prefill.

sdmyzlp requested a review from ganyi1996ppo

June 12, 2025 08:55

Contributor Author

sdmyzlp commented Jun 12, 2025

there is a pr for refactor. I suggest to block this PR until that one is merged #1169

@zzzzwwjj Discussed with xiyuan, this PR seems to not have much confliction with 1169, please have a check, thanks~

ganyi1996ppo approved these changes

View reviewed changes

ganyi1996ppo merged commit e72f94e into vllm-project:main

18 checks passed

wangxiyuan added a commit to wangxiyuan/vllm-ascend that referenced this pull request


          Revert "Support multistream of MLA vector operations (vllm-project#1135…

92155a3

…)"

This reverts commit e72f94e.

wangxiyuan added a commit to wangxiyuan/vllm-ascend that referenced this pull request


          Revert "Support multistream of MLA vector operations (vllm-project#1135…

0e6099e

…)"

This reverts commit e72f94e.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request


          Support multistream of MLA vector operations (vllm-project#1135)

9d56758

### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request


          Support multistream of MLA vector operations (vllm-project#1135)

6fec4ef

### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request


          Support multistream of MLA vector operations (vllm-project#1135)

634325c

### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

Yikun mentioned this pull request

vLLM Ascend Roadmap Q2 2025 #448

Closed

40 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation module:core module:tests