Support DeepSeekV3-style block FP8 quantization #372

mgoin · 2025-06-30T19:32:19Z

Quite a few things packed into one here, but the goal is to support the 128x128 weight and 1x128 input quantization adopted by deepseekv3 and qwen3 models. See examples: https://huggingface.co/deepseek-ai/DeepSeek-V3 and https://huggingface.co/Qwen/Qwen3-0.6B-FP8

Added BLOCK static quantization paths for weight quantization.
Added GROUP dynamic quantization paths for per-token-group input quantization. I feel like this is more understandable than the "1x128" block input quantization deepseek uses.
I’ve updated all of the places where block_structure was previously treated as an “NxM” string so that it now uses a Python list of two integers (e.g. [128, 128]). I added a pydantic validator that can convert this automatically for old checkpoints that use the string.

Here is the scheme I am proposing to support this:

# Block‐wise FP8 (deepseekv3-style quantization):
# static 128x128 per‐block weights and 
# dynamic per‐token‐group activations
FP8_BLOCK = dict(
    weights=QuantizationArgs(
        num_bits=8,
        type=QuantizationType.FLOAT,
        strategy=QuantizationStrategy.BLOCK,
        symmetric=True,
        dynamic=False,
        block_structure=[128, 128],
    ),
    input_activations=QuantizationArgs(
        num_bits=8,
        type=QuantizationType.FLOAT,
        strategy=QuantizationStrategy.GROUP,
        symmetric=True,
        dynamic=True,
        observer=None,
        group_size=128,
    ),
)

Signed-off-by: mgoin <michael@neuralmagic.com>

kylesayrs · 2025-06-30T19:47:44Z

src/compressed_tensors/quantization/quant_args.py

@@ -169,7 +169,7 @@ class QuantizationArgs(BaseModel, use_enum_values=True):
    symmetric: bool = True
    group_size: Optional[int] = None
    strategy: Optional[QuantizationStrategy] = None
-    block_structure: Optional[str] = None
+    block_structure: Optional[List[int]] = None


Suggested change

block_structure: Optional[List[int]] = None

block_structure: Optional[Tuple[int, int]] = None

I feel towards keeping it a list since I don't think we can distinguish between list and tuple in the final json config

kylesayrs · 2025-06-30T19:51:27Z

src/compressed_tensors/quantization/quant_args.py

                QuantizationStrategy.TOKEN,
                QuantizationStrategy.TENSOR,
                QuantizationStrategy.TENSOR_GROUP,
-            ):
+                QuantizationStrategy.GROUP,


This is mostly an aesthetic choice, but it might have aesthetic consequences if vllm wants to support fused input-weight quantization. Ex if input_quant_strategy == group and weight_quant_strategy == group

Might want to add some validation on quant_scheme related to this as well

src/compressed_tensors/quantization/lifecycle/forward.py

Signed-off-by: mgoin <michael@neuralmagic.com>

dsikka

Can you produce a test model to nm-testing andd it to this PR?

dsikka · 2025-07-02T13:42:32Z

src/compressed_tensors/quantization/lifecycle/forward.py

@@ -111,11 +111,15 @@ def dequantize(
        elif scale.ndim == 2:
            if scale.shape[1] == 1:
                args = QuantizationArgs(strategy=QuantizationStrategy.CHANNEL)
-            else:
+            elif scale.shape[0] == 1:
                group_size = int(x_q.shape[1] / scale.shape[1])
                args = QuantizationArgs(
                    strategy=QuantizationStrategy.GROUP, group_size=group_size
                )


Can we add a docstring explaining why this falls into the else condition?
Otherwise, I think this has grown complicated enough to easily fall prey to a bug

@dsikka Do you have a sense of why this logic exists at all/ what cases its used? I'm not sure if inferring quant strat is really safe practice

kylesayrs · 2025-07-08T15:52:31Z

src/compressed_tensors/compressors/quantized_compressors/nvfp4_quantized.py

@@ -154,6 +154,7 @@ def pack_fp4_to_uint8(x: torch.Tensor) -> torch.Tensor:
    [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0], dtype=torch.float32
 )

+


nit: remove

kylesayrs · 2025-07-08T15:54:09Z

src/compressed_tensors/quantization/lifecycle/forward.py

@@ -111,11 +111,15 @@ def dequantize(
        elif scale.ndim == 2:
            if scale.shape[1] == 1:
                args = QuantizationArgs(strategy=QuantizationStrategy.CHANNEL)
-            else:
+            elif scale.shape[0] == 1:
                group_size = int(x_q.shape[1] / scale.shape[1])
                args = QuantizationArgs(
                    strategy=QuantizationStrategy.GROUP, group_size=group_size
                )


@dsikka Do you have a sense of why this logic exists at all/ what cases its used? I'm not sure if inferring quant strat is really safe practice

mgoin added 2 commits June 30, 2025 19:26

Support DeepSeekV3-style block FP8 quantization

558870c

Signed-off-by: mgoin <michael@neuralmagic.com>

Remove math

759be7a

Signed-off-by: mgoin <michael@neuralmagic.com>

mgoin marked this pull request as ready for review June 30, 2025 19:39

mgoin mentioned this pull request Jun 30, 2025

Support DeepSeekV3-style block FP8 quantization vllm-project/llm-compressor#1607

Open

kylesayrs reviewed Jun 30, 2025

View reviewed changes

mgoin mentioned this pull request Jun 30, 2025

Support DeepSeekV3-style block FP8 quantization with CT vllm-project/vllm#20279

Open

This was referenced Jun 30, 2025

Block-wise Quantization Not supported vllm-project/llm-compressor#1475

Open

block wise quantization support vllm-project/llm-compressor#1497

Closed

mgoin added 4 commits July 1, 2025 00:44

Enforce divisible shapes

010e903

Signed-off-by: mgoin <michael@neuralmagic.com>

Format

7e642f7

Signed-off-by: mgoin <michael@neuralmagic.com>

Remove validation

bc5bea3

Signed-off-by: mgoin <michael@neuralmagic.com>

Fix string

cda798e

Signed-off-by: mgoin <michael@neuralmagic.com>

dsikka reviewed Jul 2, 2025

View reviewed changes

kylesayrs self-assigned this Jul 8, 2025

kylesayrs reviewed Jul 8, 2025

View reviewed changes

kylesayrs assigned shanjiaz and unassigned kylesayrs Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support DeepSeekV3-style block FP8 quantization #372

Support DeepSeekV3-style block FP8 quantization #372

Uh oh!

mgoin commented Jun 30, 2025 •

edited

Loading

Uh oh!

kylesayrs Jun 30, 2025

Uh oh!

mgoin Jul 1, 2025

Uh oh!

kylesayrs Jun 30, 2025

Uh oh!

kylesayrs Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Uh oh!

dsikka Jul 2, 2025

Uh oh!

kylesayrs Jul 8, 2025

Uh oh!

kylesayrs Jul 8, 2025

Uh oh!

kylesayrs Jul 8, 2025

Uh oh!

Uh oh!

	block_structure: Optional[List[int]] = None
	block_structure: Optional[Tuple[int, int]] = None

		@@ -154,6 +154,7 @@ def pack_fp4_to_uint8(x: torch.Tensor) -> torch.Tensor:
		[0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0], dtype=torch.float32
		)

Support DeepSeekV3-style block FP8 quantization #372

Are you sure you want to change the base?

Support DeepSeekV3-style block FP8 quantization #372

Uh oh!

Conversation

mgoin commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin commented Jun 30, 2025 •

edited

Loading