[MLA][Graph] Improve assertion on Graph mode with MLA #933

MengqingCao · 2025-05-22T12:33:10Z

What this PR does / why we need it?

Improve assertion on Graph mode with MLA.

When running deepseek with graph mode, the fused MLA op only support numHeads / numKvHeads ∈ {32, 64, 128}, thus we improve the assertion info here to avoid users confused with this.

Does this PR introduce any user-facing change?

Adjusting tp size is required when running deepseek-v3/r1 with graph mode. deepseek-v2-lite is not supported in graph mode.

How was this patch tested?

Test locally as the CI machine could not run V3 due to the HBM limits.

linfeng-yuan · 2025-05-22T12:57:46Z

LGTM. Note that MLA kernel may support numHeads / numKvHeads < 16 at future.

MengqingCao · 2025-05-22T13:02:11Z

thanks for the info！

…

---- Replied Message ---- | From | ***@***.***> | | Date | 05/22/2025 20:58 | | To | ***@***.***> | | Cc | Mengqing ***@***.***>***@***.***> | | Subject | Re: [vllm-project/vllm-ascend] [MLA][Graph] Improve assertion on Graph mode with MLA (PR #933) | linfeng-yuan left a comment (vllm-project/vllm-ascend#933) LGTM. Note that MLA kernel may support numHeads / numKvHeads < 16 at future. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Yikun · 2025-05-27T14:20:37Z

LGTM. Note that MLA kernel may support numHeads / numKvHeads < 16 at future.

Maybe we should add note or TODO on code and doc current limit supported in FAQ or somewhere.

MengqingCao · 2025-05-28T09:04:17Z

Maybe we should add note or TODO on code and doc current limit supported in FAQ or somewhere.

Done, PTAL, thanks!

github-actions · 2025-06-05T08:36:41Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: MengqingCao <cmq0113@163.com>

### What this PR does / why we need it? Improve assertion on Graph mode with MLA. When running deepseek with graph mode, the fused MLA op only support `numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion info here to avoid users confused with this. ### Does this PR introduce _any_ user-facing change? Adjusting tp size is required when running deepseek-v3/r1 with graph mode. deepseek-v2-lite is not supported in graph mode. ### How was this patch tested? Test locally as the CI machine could not run V3 due to the HBM limits. --------- Signed-off-by: MengqingCao <cmq0113@163.com>

### What this PR does / why we need it? Improve assertion on Graph mode with MLA. When running deepseek with graph mode, the fused MLA op only support `numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion info here to avoid users confused with this. ### Does this PR introduce _any_ user-facing change? Adjusting tp size is required when running deepseek-v3/r1 with graph mode. deepseek-v2-lite is not supported in graph mode. ### How was this patch tested? Test locally as the CI machine could not run V3 due to the HBM limits. --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

guanyuzhu · 2025-07-07T13:04:11Z

assert self.num_queries_per_kv in _ALLOWED_NUM_QUERIES_PER_KV, \
            ("The allowed number of queries per kv when enabling both MLA and Graph mode"
            " only support {32, 64, 128}, Thus this is not supported for DeepSeek-V2-Lite,"
            " as it only has 16 attention heads. And if you're using DeepSeek-V3 or DeepSeek-R1,"
            " please make sure after the tensor parallel split, num_heads / num_kv_heads in "
            "{32, 64, 128}.")

The constraint here is affected by what? Why does it have to be 32/64/128? Can other values like 28 be used (or why is it not recommended)? Because there's a user-defined model with num_heads=28 and num_kv_heads=1, which fails to run here.

MengqingCao · 2025-07-07T13:26:13Z

assert self.num_queries_per_kv in _ALLOWED_NUM_QUERIES_PER_KV, \
            ("The allowed number of queries per kv when enabling both MLA and Graph mode"
            " only support {32, 64, 128}, Thus this is not supported for DeepSeek-V2-Lite,"
            " as it only has 16 attention heads. And if you're using DeepSeek-V3 or DeepSeek-R1,"
            " please make sure after the tensor parallel split, num_heads / num_kv_heads in "
            "{32, 64, 128}.")
The constraint here is affected by what? Why does it have to be 32/64/128? Can other values like 28 be used (or why is it not recommended)? Because there's a user-defined model with num_heads=28 and num_kv_heads=1, which fails to run here.

This constraint comes from cann op, and it will be moved after #1653 and #1508

guanyuzhu · 2025-07-07T13:38:28Z

1.Although the restriction on multiples of 32 has been lifted, is it still recommended to set parameters as multiples of 32? I've seen analyses suggesting that using multiples of 32 allows for hardware alignment and full utilization of hardware resources.
2.Just to confirm, after lifting this restriction, configurations with 28 heads should also work, right? (Given: head_size=576, num_heads=32, num_kv_heads=1)

guanyuzhu · 2025-07-07T13:50:31Z

1.Although the restriction on multiples of 32 has been lifted, is it still recommended to set parameters as multiples of 32? I've seen analyses suggesting that using multiples of 32 allows for hardware alignment and full utilization of hardware resources.
2.Just to confirm, after lifting this restriction, configurations with 28 heads should also work, right? (Given: head_size=576, num_heads=32, num_kv_heads=1)

assert self.num_queries_per_kv in _ALLOWED_NUM_QUERIES_PER_KV, \
            ("The allowed number of queries per kv when enabling both MLA and Graph mode"
            " only support {32, 64, 128}, Thus this is not supported for DeepSeek-V2-Lite,"
            " as it only has 16 attention heads. And if you're using DeepSeek-V3 or DeepSeek-R1,"
            " please make sure after the tensor parallel split, num_heads / num_kv_heads in "
            "{32, 64, 128}.")
The constraint here is affected by what? Why does it have to be 32/64/128? Can other values like 28 be used (or why is it not recommended)? Because there's a user-defined model with num_heads=28 and num_kv_heads=1, which fails to run here.
This constraint comes from cann op, and it will be moved after #1653 and #1508

1.Although the restriction on multiples of 32 has been lifted, is it still recommended to set parameters as multiples of 32? I've seen analyses suggesting that using multiples of 32 allows for hardware alignment and full utilization of hardware resources.
2.Just to confirm, after lifting this restriction, configurations with 28 heads should also work, right? (Given: head_size=576, num_heads=32, num_kv_heads=1)

Improve assertion on Graph mode with MLA. When running deepseek with graph mode, the fused MLA op only support `numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion info here to avoid users confused with this. Adjusting tp size is required when running deepseek-v3/r1 with graph mode. deepseek-v2-lite is not supported in graph mode. Test locally as the CI machine could not run V3 due to the HBM limits. --------- Signed-off-by: MengqingCao <cmq0113@163.com>

wangxiyuan approved these changes May 26, 2025

View reviewed changes

MengqingCao mentioned this pull request May 27, 2025

[New Model]: DeepSeek_V2-lite running on graph mode #972

Open

github-actions bot added the documentation Improvements or additions to documentation label May 28, 2025

MengqingCao force-pushed the ge branch from 594721a to a550e39 Compare May 30, 2025 03:06

github-actions bot added the merge-conflicts label Jun 5, 2025

[MLA][Graph] Improve assertion on Graph mode with MLA

0116886

Signed-off-by: MengqingCao <cmq0113@163.com>

MengqingCao force-pushed the ge branch from a550e39 to 0116886 Compare June 9, 2025 08:51

github-actions bot removed the merge-conflicts label Jun 9, 2025

fix ruff

297fc88

Signed-off-by: MengqingCao <cmq0113@163.com>

wangxiyuan approved these changes Jun 10, 2025

View reviewed changes

wangxiyuan merged commit 8dd686d into vllm-project:main Jun 10, 2025
22 of 23 checks passed

MengqingCao deleted the ge branch June 28, 2025 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MLA][Graph] Improve assertion on Graph mode with MLA #933

[MLA][Graph] Improve assertion on Graph mode with MLA #933

Uh oh!

MengqingCao commented May 22, 2025

Uh oh!

linfeng-yuan commented May 22, 2025 •

edited

Loading

Uh oh!

MengqingCao commented May 22, 2025 via email

Uh oh!

Yikun commented May 27, 2025

Uh oh!

MengqingCao commented May 28, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

Uh oh!

guanyuzhu commented Jul 7, 2025

Uh oh!

MengqingCao commented Jul 7, 2025

Uh oh!

guanyuzhu commented Jul 7, 2025

Uh oh!

guanyuzhu commented Jul 7, 2025

Uh oh!

Uh oh!

[MLA][Graph] Improve assertion on Graph mode with MLA #933

[MLA][Graph] Improve assertion on Graph mode with MLA #933

Uh oh!

Conversation

MengqingCao commented May 22, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

linfeng-yuan commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao commented May 22, 2025 via email

Uh oh!

Yikun commented May 27, 2025

Uh oh!

MengqingCao commented May 28, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

Uh oh!

guanyuzhu commented Jul 7, 2025

Uh oh!

MengqingCao commented Jul 7, 2025

Uh oh!

guanyuzhu commented Jul 7, 2025

Uh oh!

guanyuzhu commented Jul 7, 2025

Uh oh!

Uh oh!

linfeng-yuan commented May 22, 2025 •

edited

Loading