[Quantization Args] Add scale and zp dtype #508

dsikka · 2025-10-28T21:03:54Z

Summary

Add the option to define a scale or zp dtype when defining your quantization schemes
If defined, these scale is used to round the scale when generating qparams and cast the scale during compression time
Through this, we can remove the requirement of is_fp4 and some of the fp4 specific functionality that was tied closely to the global scale generation

- We are not applying this logic for now but would like to discuss with the team to gather thoughts:

We set the zp_dtype to None if running symmetric quantization.
We set the scale_dtype to None if running dynamic or local quantization.

Question:

The zp_dtype for int4 is int8, but we pack to int32 which is what gets saved to disk / in the checkpoint. Does it make sense to have specific logic to set the zp_dtype as int32 when the config is saved, as that is what ends up in the checkpoint? I am leaning towards yes as we want the ct config to best reflect what is in the checkpoint

Example Updates:

KV Cache Scheme:

"kv_cache_scheme": {
  "actorder": null,
  "block_structure": null,
  "dynamic": false,
  "group_size": null,
  "num_bits": 8,
  "observer": "minmax",
  "observer_kwargs": {},
  "scale_dtype": "bfloat16",
  "strategy": "tensor",
  "symmetric": true,
  "type": "float",
  "zp_dtype": null
}

NVFP4:

"quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "nvfp4-pack-quantized",
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": "local",
          "group_size": 16,
          "num_bits": 4,
          "observer": "static_minmax",
          "observer_kwargs": {},
          "scale_dtype": null,
          "strategy": "tensor_group",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 16,
          "num_bits": 4,
          "observer": "static_minmax",
          "observer_kwargs": {},
          "scale_dtype": "float8_e4m3fn",
          "strategy": "tensor_group",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        }
      }
    },

FP8 Dynamic:

"quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "float-quantized",
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": true,
          "group_size": null,
          "num_bits": 8,
          "observer": null,
          "observer_kwargs": {},
          "scale_dtype": null,
          "strategy": "token",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": null,
          "num_bits": 8,
          "observer": "minmax",
          "observer_kwargs": {},
          "scale_dtype": "bfloat16",
          "strategy": "channel",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        }
      }
    },

W4A16 + Asym

 "quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "pack-quantized",
        "input_activations": null,
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 128,
          "num_bits": 4,
          "observer": "minmax",
          "observer_kwargs": {},
          "scale_dtype": "torch.bfloat16",
          "strategy": "group",
          "symmetric": false,
          "type": "int",
          "zp_dtype": "torch.int8"
        }
      }
    },

dsikka · 2025-10-29T21:38:23Z

Dipika todo: should try a w4a16 with zp to make sure it is saved correctly

tests/test_quantization/lifecycle/test_apply.py

src/compressed_tensors/quantization/quant_args.py

src/compressed_tensors/quantization/quant_config.py

src/compressed_tensors/quantization/lifecycle/initialize.py

src/compressed_tensors/quantization/utils/helpers.py

HDCharles · 2025-10-30T13:24:24Z

Im unsure about the zp_dtype = None meaning symmetric quantization if we're going to leave symmetric as it's own field. Feels like either symmetric should be deprecated or zp_dtype should be ignored when symmetric is true.

I strongly dislike scale_dtype = None meaning dynamic quantization, that seems entirely unintuitive. While zp_dtype=None could be understood as 'there is no zp'-> symmetric quant, scale_dtype=None has no such logical progression to dynamic quant. It also has the same issue as above with duplicating the information in the dynamic field.

dsikka · 2025-10-30T14:01:10Z

Im unsure about the zp_dtype = None meaning symmetric quantization if we're going to leave symmetric as it's own field. Feels like either symmetric should be deprecated or zp_dtype should be ignored when symmetric is true.

I strongly dislike scale_dtype = None meaning dynamic quantization, that seems entirely unintuitive. While zp_dtype=None could be understood as 'there is no zp'-> symmetric quant, scale_dtype=None has no such logical progression to dynamic quant. It also has the same issue as above with duplicating the information in the dynamic field.

The point is to make it clear in the metadata what is compressed on disk. When doing asymmetric quantization or dynamic quantization, neither the scale or zp are saved or set in the checkpoint. Having them set in the config would be extremely confusing.

You can also run dynamic generations with any fp dtype, depending on how you load your model as it will just match the dtype of the activations. So having it defined in the config doesn't make a lot of sense.

In the case of the zp_dtype, it is ignored if symmetric. It is set as None in the config.

src/compressed_tensors/quantization/quant_config.py

src/compressed_tensors/quantization/utils/helpers.py

src/compressed_tensors/quantization/quant_args.py

src/compressed_tensors/quantization/utils/helpers.py

src/compressed_tensors/quantization/quant_args.py

src/compressed_tensors/compressors/quantized_compressors/fp4_quantized.py

kylesayrs · 2025-11-05T21:38:25Z

src/compressed_tensors/compressors/quantized_compressors/fp4_quantized.py

        if device is not None:
            weight_packed = weight_packed.to(device)
        compressed_dict["weight_packed"] = weight_packed
+        compressed_dict["weight_scale"] = scale.to(quantization_args.scale_dtype)


Shouldn't this be round_to_quantized_type, with the eps replacement? That way you guarantee that the value is properly clamped and non-zero

This is already being applied when the scale is generated in calculate_qparams. We clamp to the fp8 range but maintain the dense dtype. We then cast to fp8 during compression.

kylesayrs · 2025-11-05T21:39:50Z

src/compressed_tensors/quantization/lifecycle/forward.py

    # to further scale the local `scale` parameter
    if global_scale is not None:
-        scale = scale.to(global_scale.dtype) / global_scale
+        scale = scale / global_scale


Scale is still being implicitly cast to global_scale.dtype, right?

We apply the global_scale in calculate_qparams, so the scale should be fp32 here.

src/compressed_tensors/quantization/lifecycle/initialize.py

src/compressed_tensors/quantization/quant_args.py

kylesayrs · 2025-11-05T21:43:27Z

src/compressed_tensors/quantization/quant_args.py

+        return torch.clamp(tensor, finfo.min, finfo.max).to(dtype)
+    else:
+        iinfo = torch.iinfo(dtype)
+        return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max))


Do you need a final cast?

Suggested change

return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max))

return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max)).to(dtype)

We use torch.round for all of our ints. I’m maintaining existing functionality

kylesayrs · 2025-11-05T21:45:33Z

src/compressed_tensors/quantization/quant_args.py

+        return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max))
+
+
+def _round_args(tensor: torch.Tensor, args: QuantizationArgs):


Why don't we need to clamp in this case? Maybe there's a way to combine this with _round_dtype?

On main, we use this with an outside clamp method, which is what I've kept for now:

compressed-tensors/src/compressed_tensors/quantization/lifecycle/forward.py

Line 477 in 52792be

clamped_value = torch.clamp(

src/compressed_tensors/quantization/quant_args.py

kylesayrs · 2025-11-05T21:49:44Z

src/compressed_tensors/quantization/utils/helpers.py

+    )
+    scales = torch.where(
+        scales == 0,
+        torch.tensor(eps, dtype=scales.dtype, device=device),


Consider folding this torch.tensor into _get_dtype_eps

I am not going to do that here. We can consider this in a follow-up

dsikka added 9 commits October 28, 2025 20:56

update

1211455

add back test

41aa0fc

update

de9f16a

update

c02000d

fix serialization

fbccd40

fix condition

2a2f2a3

update

cbd6d66

update

6fca61f

update

e53bf78

dsikka marked this pull request as ready for review October 29, 2025 21:32

brian-dellabetta reviewed Oct 29, 2025

View reviewed changes

tests/test_quantization/lifecycle/test_apply.py Show resolved Hide resolved

kylesayrs reviewed Oct 29, 2025

View reviewed changes

src/compressed_tensors/quantization/quant_args.py Outdated Show resolved Hide resolved

src/compressed_tensors/quantization/quant_config.py Outdated Show resolved Hide resolved

kylesayrs reviewed Oct 29, 2025

View reviewed changes

dsikka added 9 commits October 31, 2025 02:36

update

dec2b2c

update

8b7181c

remove torch

9bd9040

update

ecb7d7f

update

933c624

update tests

ee742c0

update

e7475d2

update

e7d6b52

fix comment

1970b26

dsikka requested a review from kylesayrs November 3, 2025 19:56

kylesayrs reviewed Nov 3, 2025

View reviewed changes

update

e8107e5

kylesayrs reviewed Nov 4, 2025

View reviewed changes

src/compressed_tensors/quantization/quant_args.py Show resolved Hide resolved

updatE

7fbdbbf

dsikka added 8 commits November 5, 2025 13:07

update

f04d7e3

update

3a1ec7e

update

7ab90d5

update

d55633e

update

9e5a93b

update

9d229c9

update

6453b2e

update

a628a6c

dsikka requested a review from kylesayrs November 5, 2025 21:34

updatE

572776c

kylesayrs reviewed Nov 5, 2025

View reviewed changes

update

e987088

dsikka requested a review from kylesayrs November 5, 2025 23:38

	return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max))
	return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max)).to(dtype)

		return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max))


		def _round_args(tensor: torch.Tensor, args: QuantizationArgs):

[Quantization Args] Add scale and zp dtype #508

Are you sure you want to change the base?

[Quantization Args] Add scale and zp dtype #508

Uh oh!

Conversation

dsikka commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Question:

Example Updates:

KV Cache Scheme:

NVFP4:

FP8 Dynamic:

W4A16 + Asym

Uh oh!

dsikka commented Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HDCharles commented Oct 30, 2025

Uh oh!

dsikka commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dsikka commented Oct 28, 2025 •

edited

Loading

dsikka commented Oct 30, 2025 •

edited

Loading

dsikka Nov 5, 2025 •

edited

Loading

dsikka Nov 5, 2025 •

edited

Loading