Skip to content

Support multistream of MLA vector operations #1135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 12, 2025

Conversation

sdmyzlp
Copy link
Contributor

@sdmyzlp sdmyzlp commented Jun 9, 2025

What this PR does / why we need it?

Move all vector operations to a secondary stream, with the expected overlaping being:

              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |

Currently, the IndexByTensor operators introduced by computation of cos and sin can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before matmul W_UQ to avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation of cos and sin up to the first layer.

Does this PR introduce any user-facing change?

Controlled by torchair_graph_config.enable_multistream_mla, defaulted to False.

How was this patch tested?

Tested on 1x16 910 node, with tailored 2 layer DSKv2.

@sdmyzlp sdmyzlp changed the title Support multistream MLA Support multistream of MLA vector operations Jun 9, 2025
Copy link

github-actions bot commented Jun 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@sdmyzlp sdmyzlp force-pushed the br_multi_stream_mla branch 2 times, most recently from 546e6c7 to 5b20d64 Compare June 9, 2025 13:08
@sdmyzlp sdmyzlp force-pushed the br_multi_stream_mla branch 2 times, most recently from d4eac5b to 4967ade Compare June 9, 2025 14:22
Copy link

github-actions bot commented Jun 9, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

1 similar comment
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@sdmyzlp sdmyzlp force-pushed the br_multi_stream_mla branch from b7e7700 to fd0f6fa Compare June 11, 2025 02:49
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

With the expected overlaping being:
```
              | q_rmsnorm |  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV |    matmul W_UQ     | split | matmul W_KV_T |
```
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
@wangxiyuan
Copy link
Collaborator

there is a pr for refactor. I suggest to block this PR until that one is merged #1169

decode_k_pe, decode_k_nope = self.exec_kv(
hidden_states_or_kv_c_normed, cos, sin, kv_cache,
attn_metadata.slot_mapping)
with npu_stream_switch("mla_secondary",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this npu_stream_switch can be used inside the torchair right? if that's so, can we further optimize the dbo path of deepseek to gain more performance boost.

Copy link
Contributor Author

@sdmyzlp sdmyzlp Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with torchair colleagues, the with torch.npu.stream(...): scheme is not supported by torchair graph mode while npu_stream_switch only works under graph mode, so I recommend leaving this to later PRs:

  1. to provide a joint version of npu_stream_switch encapsulating both cases, as well as,
  2. adding graph mode support for dbo

hidden_states_or_q_c = self.q_a_layernorm(ckq)
use_multistream_mla = (self.enable_multistream_mla
and attn_metadata is not None
and attn_metadata.num_decodes > 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, multistream mla will not be triggered if there are only prefill requests?

Copy link
Contributor Author

@sdmyzlp sdmyzlp Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, current multistream utilities only support graph mode, i.e. npu_stream_switch returns contextlib.nullcontext() on non-graph mode, which is the case for prefill.

@sdmyzlp sdmyzlp requested a review from ganyi1996ppo June 12, 2025 08:55
@sdmyzlp
Copy link
Contributor Author

sdmyzlp commented Jun 12, 2025

there is a pr for refactor. I suggest to block this PR until that one is merged #1169

@zzzzwwjj Discussed with xiyuan, this PR seems to not have much confliction with 1169, please have a check, thanks~

@ganyi1996ppo ganyi1996ppo merged commit e72f94e into vllm-project:main Jun 12, 2025
18 checks passed
wangxiyuan added a commit to wangxiyuan/vllm-ascend that referenced this pull request Jun 12, 2025
wangxiyuan added a commit to wangxiyuan/vllm-ascend that referenced this pull request Jun 13, 2025
…)"

This reverts commit e72f94e.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected
overlaping being:
```
              | q_rmsnorm |                  | kv_norm_rope_cache |       | q_rope |
| matmul W_DQ | matmul W_DKV | index | index |    matmul W_UQ     | split | matmul W_KV_T |
```

Currently, the `IndexByTensor` operators introduced by computation of
`cos` and `sin` can't be offloaded to the secondary stream due to a
known bug of graph fusion optimization pass. So we instead keep it in
the main stream, only requires it be computed before `matmul W_UQ` to
avoid hindering later overlapping. The problem may be solved by later
optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up
to the first layer.

### Does this PR introduce _any_ user-facing change?
Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted
to False.

### How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
@Yikun Yikun mentioned this pull request Jun 28, 2025
40 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation module:core module:tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants