-
Notifications
You must be signed in to change notification settings - Fork 279
Support multistream of MLA vector operations #1135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
546e6c7
to
5b20d64
Compare
d4eac5b
to
4967ade
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
4967ade
to
be59b9b
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
be59b9b
to
dcea3aa
Compare
dcea3aa
to
980b356
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1 similar comment
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
980b356
to
b7e7700
Compare
b7e7700
to
fd0f6fa
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
With the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | matmul W_UQ | split | matmul W_KV_T | ``` Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
fd0f6fa
to
c89bf7b
Compare
there is a pr for refactor. I suggest to block this PR until that one is merged #1169 |
decode_k_pe, decode_k_nope = self.exec_kv( | ||
hidden_states_or_kv_c_normed, cos, sin, kv_cache, | ||
attn_metadata.slot_mapping) | ||
with npu_stream_switch("mla_secondary", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this npu_stream_switch
can be used inside the torchair right? if that's so, can we further optimize the dbo path of deepseek to gain more performance boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with torchair colleagues, the with torch.npu.stream(...):
scheme is not supported by torchair graph mode while npu_stream_switch
only works under graph mode, so I recommend leaving this to later PRs:
- to provide a joint version of
npu_stream_switch
encapsulating both cases, as well as, - adding graph mode support for dbo
hidden_states_or_q_c = self.q_a_layernorm(ckq) | ||
use_multistream_mla = (self.enable_multistream_mla | ||
and attn_metadata is not None | ||
and attn_metadata.num_decodes > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, multistream mla will not be triggered if there are only prefill requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, current multistream utilities only support graph mode, i.e. npu_stream_switch
returns contextlib.nullcontext()
on non-graph mode, which is the case for prefill.
…)" This reverts commit e72f94e.
…)" This reverts commit e72f94e. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T | ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T | ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
### What this PR does / why we need it? Move all vector operations to a secondary stream, with the expected overlaping being: ``` | q_rmsnorm | | kv_norm_rope_cache | | q_rope | | matmul W_DQ | matmul W_DKV | index | index | matmul W_UQ | split | matmul W_KV_T | ``` Currently, the `IndexByTensor` operators introduced by computation of `cos` and `sin` can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed before `matmul W_UQ` to avoid hindering later overlapping. The problem may be solved by later optimization (vllm-project#993), which hoists the computation of `cos` and `sin` up to the first layer. ### Does this PR introduce _any_ user-facing change? Controlled by `torchair_graph_config.enable_multistream_mla`, defaulted to False. ### How was this patch tested? Tested on 1x16 910 node, with tailored 2 layer DSKv2. Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
What this PR does / why we need it?
Move all vector operations to a secondary stream, with the expected overlaping being:
Currently, the
IndexByTensor
operators introduced by computation ofcos
andsin
can't be offloaded to the secondary stream due to a known bug of graph fusion optimization pass. So we instead keep it in the main stream, only requires it be computed beforematmul W_UQ
to avoid hindering later overlapping. The problem may be solved by later optimization (#993), which hoists the computation ofcos
andsin
up to the first layer.Does this PR introduce any user-facing change?
Controlled by
torchair_graph_config.enable_multistream_mla
, defaulted to False.How was this patch tested?
Tested on 1x16 910 node, with tailored 2 layer DSKv2.