Support DeepSeekV3-style block FP8 quantization with CT #21337

mgoin · 2025-07-21T21:47:43Z

Purpose

Redo of #20279

Relies on recent support in compressed-tensors (neuralmagic/compressed-tensors#372) and llm-compressor (vllm-project/llm-compressor#1607) to produce the models.

This PR implements W8A8 FP8 block quantization support for compressed-tensors models. This is focused on supporting the DeepSeekV3-style format, which has 128x128 block weights and 1x128 block activations (really per-token-group).

Most of the logic is ported directly from fp8.py and I hope to refactor the utilities to be shared eventually.

Test Plan

Manual testing with newly produced models. I'll add lm-eval in another PR

Test Result

Dense

CT result:

lm_eval --model vllm --model_args pretrained=mgoin/Qwen3-0.6B-FP8-BLOCK --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=mgoin/Qwen3-0.6B-FP8-BLOCK,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3973|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.3995|±  |0.0135|

Ref:

lm_eval --model vllm --model_args pretrained=Qwen/Qwen3-0.6B-FP8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=Qwen/Qwen3-0.6B-FP8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3973|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.3995|±  |0.0135|

MoE

CT result:

vllm (pretrained=mgoin/Qwen3-30B-A3B-FP8-BLOCK,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8158|±  |0.0107|
|     |       |strict-match    |     5|exact_match|↑  |0.8923|±  |0.0085|

Ref:

vllm (pretrained=Qwen/Qwen3-30B-A3B-FP8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8158|±  |0.0107|
|     |       |strict-match    |     5|exact_match|↑  |0.8923|±  |0.0085|

Signed-off-by: mgoin <mgoin64@gmail.com>

github-actions · 2025-07-21T21:47:52Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request adds support for DeepSeekV3-style block FP8 quantization using compressed-tensors. The changes are extensive, touching several files related to quantization and introducing new logic for handling block-quantized weights, especially in MoE layers. The PR also adds support for new hardware features like DeepGEMM on Blackwell.

While the overall approach seems correct, I've found a critical issue in the control flow of process_weights_after_loading in compressed_tensors_moe.py. The logic for block-quantized and non-block-quantized paths is mixed, leading to duplicated operations and potential runtime errors. This needs to be refactored to ensure correctness.

Other changes, such as refactoring to decouple layers and using more specific type hints, are good improvements to the codebase.

mergify · 2025-07-22T14:36:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Support DeepSeekV3-style block FP8 quantization with CT

bbf083a

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin requested review from robertgshaw2-redhat and tlrmchlsmth as code owners July 21, 2025 21:47

mergify bot added the deepseek Related to DeepSeek models label Jul 21, 2025

gemini-code-assist bot reviewed Jul 21, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 22, 2025

Merge branch 'main' into ct-support-deepseek-block-fp8-v2

2e694f2

mergify bot removed the needs-rebase label Jul 22, 2025

mgoin added ready ONLY add when PR is ready to merge/full CI is needed and removed ready ONLY add when PR is ready to merge/full CI is needed labels Jul 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support DeepSeekV3-style block FP8 quantization with CT #21337

Support DeepSeekV3-style block FP8 quantization with CT #21337

mgoin commented Jul 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

Support DeepSeekV3-style block FP8 quantization with CT #21337

Are you sure you want to change the base?

Support DeepSeekV3-style block FP8 quantization with CT #21337

Conversation

mgoin commented Jul 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Dense

MoE

Uh oh!

github-actions bot commented Jul 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Jul 22, 2025

Uh oh!

Uh oh!

mgoin commented Jul 21, 2025 •

edited by github-actions bot

Loading