Skip to content

[MLA][Graph] Improve assertion on Graph mode with MLA #933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 10, 2025

Conversation

MengqingCao
Copy link
Collaborator

What this PR does / why we need it?

Improve assertion on Graph mode with MLA.

When running deepseek with graph mode, the fused MLA op only support numHeads / numKvHeads ∈ {32, 64, 128}, thus we improve the assertion info here to avoid users confused with this.

Does this PR introduce any user-facing change?

Adjusting tp size is required when running deepseek-v3/r1 with graph mode. deepseek-v2-lite is not supported in graph mode.

How was this patch tested?

Test locally as the CI machine could not run V3 due to the HBM limits.

@linfeng-yuan
Copy link
Contributor

linfeng-yuan commented May 22, 2025

LGTM. Note that MLA kernel may support numHeads / numKvHeads < 16 at future.

@MengqingCao
Copy link
Collaborator Author

MengqingCao commented May 22, 2025 via email

@Yikun
Copy link
Collaborator

Yikun commented May 27, 2025

LGTM. Note that MLA kernel may support numHeads / numKvHeads < 16 at future.

Maybe we should add note or TODO on code and doc current limit supported in FAQ or somewhere.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 28, 2025
@MengqingCao
Copy link
Collaborator Author

Maybe we should add note or TODO on code and doc current limit supported in FAQ or somewhere.

Done, PTAL, thanks!

Copy link

github-actions bot commented Jun 5, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
@wangxiyuan wangxiyuan merged commit 8dd686d into vllm-project:main Jun 10, 2025
22 of 23 checks passed
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Improve assertion on Graph mode with MLA.

When running deepseek with graph mode, the fused MLA op only support
`numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion
info here to avoid users confused with this.

### Does this PR introduce _any_ user-facing change?
Adjusting tp size is required when running deepseek-v3/r1 with graph
mode. deepseek-v2-lite is not supported in graph mode.

### How was this patch tested?
Test locally as the CI machine could not run V3 due to the HBM limits.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Improve assertion on Graph mode with MLA.

When running deepseek with graph mode, the fused MLA op only support
`numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion
info here to avoid users confused with this.

### Does this PR introduce _any_ user-facing change?
Adjusting tp size is required when running deepseek-v3/r1 with graph
mode. deepseek-v2-lite is not supported in graph mode.

### How was this patch tested?
Test locally as the CI machine could not run V3 due to the HBM limits.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
momo609 pushed a commit to momo609/vllm-ascend that referenced this pull request Jun 17, 2025
### What this PR does / why we need it?
Improve assertion on Graph mode with MLA.

When running deepseek with graph mode, the fused MLA op only support
`numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion
info here to avoid users confused with this.

### Does this PR introduce _any_ user-facing change?
Adjusting tp size is required when running deepseek-v3/r1 with graph
mode. deepseek-v2-lite is not supported in graph mode.

### How was this patch tested?
Test locally as the CI machine could not run V3 due to the HBM limits.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
@MengqingCao MengqingCao deleted the ge branch June 28, 2025 01:42
@guanyuzhu
Copy link

assert self.num_queries_per_kv in _ALLOWED_NUM_QUERIES_PER_KV, \
            ("The allowed number of queries per kv when enabling both MLA and Graph mode"
            " only support {32, 64, 128}, Thus this is not supported for DeepSeek-V2-Lite,"
            " as it only has 16 attention heads. And if you're using DeepSeek-V3 or DeepSeek-R1,"
            " please make sure after the tensor parallel split, num_heads / num_kv_heads in "
            "{32, 64, 128}.")

The constraint here is affected by what? Why does it have to be 32/64/128? Can other values like 28 be used (or why is it not recommended)? Because there's a user-defined model with num_heads=28 and num_kv_heads=1, which fails to run here.

@MengqingCao
Copy link
Collaborator Author

assert self.num_queries_per_kv in _ALLOWED_NUM_QUERIES_PER_KV, \
            ("The allowed number of queries per kv when enabling both MLA and Graph mode"
            " only support {32, 64, 128}, Thus this is not supported for DeepSeek-V2-Lite,"
            " as it only has 16 attention heads. And if you're using DeepSeek-V3 or DeepSeek-R1,"
            " please make sure after the tensor parallel split, num_heads / num_kv_heads in "
            "{32, 64, 128}.")

The constraint here is affected by what? Why does it have to be 32/64/128? Can other values like 28 be used (or why is it not recommended)? Because there's a user-defined model with num_heads=28 and num_kv_heads=1, which fails to run here.

This constraint comes from cann op, and it will be moved after #1653 and #1508

@guanyuzhu
Copy link

1.Although the restriction on multiples of 32 has been lifted, is it still recommended to set parameters as multiples of 32? I've seen analyses suggesting that using multiples of 32 allows for hardware alignment and full utilization of hardware resources.
2.Just to confirm, after lifting this restriction, configurations with 28 heads should also work, right? (Given: head_size=576, num_heads=32, num_kv_heads=1)

@guanyuzhu
Copy link

1.Although the restriction on multiples of 32 has been lifted, is it still recommended to set parameters as multiples of 32? I've seen analyses suggesting that using multiples of 32 allows for hardware alignment and full utilization of hardware resources.
2.Just to confirm, after lifting this restriction, configurations with 28 heads should also work, right? (Given: head_size=576, num_heads=32, num_kv_heads=1)

assert self.num_queries_per_kv in _ALLOWED_NUM_QUERIES_PER_KV, \
            ("The allowed number of queries per kv when enabling both MLA and Graph mode"
            " only support {32, 64, 128}, Thus this is not supported for DeepSeek-V2-Lite,"
            " as it only has 16 attention heads. And if you're using DeepSeek-V3 or DeepSeek-R1,"
            " please make sure after the tensor parallel split, num_heads / num_kv_heads in "
            "{32, 64, 128}.")

The constraint here is affected by what? Why does it have to be 32/64/128? Can other values like 28 be used (or why is it not recommended)? Because there's a user-defined model with num_heads=28 and num_kv_heads=1, which fails to run here.

This constraint comes from cann op, and it will be moved after #1653 and #1508

1.Although the restriction on multiples of 32 has been lifted, is it still recommended to set parameters as multiples of 32? I've seen analyses suggesting that using multiples of 32 allows for hardware alignment and full utilization of hardware resources.
2.Just to confirm, after lifting this restriction, configurations with 28 heads should also work, right? (Given: head_size=576, num_heads=32, num_kv_heads=1)

shiyuan680 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request Jul 7, 2025
Improve assertion on Graph mode with MLA.

When running deepseek with graph mode, the fused MLA op only support
`numHeads / numKvHeads ∈ {32, 64, 128}`, thus we improve the assertion
info here to avoid users confused with this.

Adjusting tp size is required when running deepseek-v3/r1 with graph
mode. deepseek-v2-lite is not supported in graph mode.

Test locally as the CI machine could not run V3 due to the HBM limits.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants