Skip to content

Support DeepSeekV3-style block FP8 quantization with CT #21337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Jul 21, 2025

Purpose

Redo of #20279

Relies on recent support in compressed-tensors (neuralmagic/compressed-tensors#372) and llm-compressor (vllm-project/llm-compressor#1607) to produce the models.

This PR implements W8A8 FP8 block quantization support for compressed-tensors models. This is focused on supporting the DeepSeekV3-style format, which has 128x128 block weights and 1x128 block activations (really per-token-group).

Most of the logic is ported directly from fp8.py and I hope to refactor the utilities to be shared eventually.

Test Plan

Manual testing with newly produced models. I'll add lm-eval in another PR

Test Result

Dense

CT result:

lm_eval --model vllm --model_args pretrained=mgoin/Qwen3-0.6B-FP8-BLOCK --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=mgoin/Qwen3-0.6B-FP8-BLOCK,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3973|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.3995|±  |0.0135|

Ref:

lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B-FP8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=Qwen/Qwen3-0.6B-FP8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3973|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.3995|±  |0.0135|

MoE

CT result:

vllm (pretrained=mgoin/Qwen3-30B-A3B-FP8-BLOCK,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8158|±  |0.0107|
|     |       |strict-match    |     5|exact_match|↑  |0.8923|±  |0.0085|

Ref:

vllm (pretrained=Qwen/Qwen3-30B-A3B-FP8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8158|±  |0.0107|
|     |       |strict-match    |     5|exact_match|↑  |0.8923|±  |0.0085|

Signed-off-by: mgoin <mgoin64@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the deepseek Related to DeepSeek models label Jul 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for DeepSeekV3-style block FP8 quantization using compressed-tensors. The changes are extensive, touching several files related to quantization and introducing new logic for handling block-quantized weights, especially in MoE layers. The PR also adds support for new hardware features like DeepGEMM on Blackwell.

While the overall approach seems correct, I've found a critical issue in the control flow of process_weights_after_loading in compressed_tensors_moe.py. The logic for block-quantized and non-block-quantized paths is mixed, leading to duplicated operations and potential runtime errors. This needs to be refactored to ensure correctness.

Other changes, such as refactoring to decouple layers and using more specific type hints, are good improvements to the codebase.

Copy link

mergify bot commented Jul 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 22, 2025
@mergify mergify bot removed the needs-rebase label Jul 22, 2025
@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed and removed ready ONLY add when PR is ready to merge/full CI is needed labels Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deepseek Related to DeepSeek models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant