Skip to content

Conversation

@dsikka
Copy link
Collaborator

@dsikka dsikka commented Oct 28, 2025

Summary

  • Add the option to define a scale or zp dtype when defining your quantization schemes
  • If defined, these scale is used to round the scale when generating qparams and cast the scale during compression time
  • Through this, we can remove the requirement of is_fp4 and some of the fp4 specific functionality that was tied closely to the global scale generation

- We are not applying this logic for now but would like to discuss with the team to gather thoughts:

  1. We set the zp_dtype to None if running symmetric quantization.
  2. We set the scale_dtype to None if running dynamic or local quantization.

Question:

  • The zp_dtype for int4 is int8, but we pack to int32 which is what gets saved to disk / in the checkpoint. Does it make sense to have specific logic to set the zp_dtype as int32 when the config is saved, as that is what ends up in the checkpoint? I am leaning towards yes as we want the ct config to best reflect what is in the checkpoint

Example Updates:

KV Cache Scheme:

"kv_cache_scheme": {
  "actorder": null,
  "block_structure": null,
  "dynamic": false,
  "group_size": null,
  "num_bits": 8,
  "observer": "minmax",
  "observer_kwargs": {},
  "scale_dtype": "bfloat16",
  "strategy": "tensor",
  "symmetric": true,
  "type": "float",
  "zp_dtype": null
}

NVFP4:

"quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "nvfp4-pack-quantized",
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": "local",
          "group_size": 16,
          "num_bits": 4,
          "observer": "static_minmax",
          "observer_kwargs": {},
          "scale_dtype": null,
          "strategy": "tensor_group",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 16,
          "num_bits": 4,
          "observer": "static_minmax",
          "observer_kwargs": {},
          "scale_dtype": "float8_e4m3fn",
          "strategy": "tensor_group",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        }
      }
    },

FP8 Dynamic:

"quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "float-quantized",
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": true,
          "group_size": null,
          "num_bits": 8,
          "observer": null,
          "observer_kwargs": {},
          "scale_dtype": null,
          "strategy": "token",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": null,
          "num_bits": 8,
          "observer": "minmax",
          "observer_kwargs": {},
          "scale_dtype": "bfloat16",
          "strategy": "channel",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        }
      }
    },

W4A16 + Asym

 "quantization_config": {
    "config_groups": {
      "group_0": {
        "format": "pack-quantized",
        "input_activations": null,
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 128,
          "num_bits": 4,
          "observer": "minmax",
          "observer_kwargs": {},
          "scale_dtype": "torch.bfloat16",
          "strategy": "group",
          "symmetric": false,
          "type": "int",
          "zp_dtype": "torch.int8"
        }
      }
    },

@dsikka dsikka marked this pull request as ready for review October 29, 2025 21:32
@dsikka
Copy link
Collaborator Author

dsikka commented Oct 29, 2025

Dipika todo: should try a w4a16 with zp to make sure it is saved correctly

@HDCharles
Copy link
Collaborator

Im unsure about the zp_dtype = None meaning symmetric quantization if we're going to leave symmetric as it's own field. Feels like either symmetric should be deprecated or zp_dtype should be ignored when symmetric is true.

I strongly dislike scale_dtype = None meaning dynamic quantization, that seems entirely unintuitive. While zp_dtype=None could be understood as 'there is no zp'-> symmetric quant, scale_dtype=None has no such logical progression to dynamic quant. It also has the same issue as above with duplicating the information in the dynamic field.

@dsikka
Copy link
Collaborator Author

dsikka commented Oct 30, 2025

Im unsure about the zp_dtype = None meaning symmetric quantization if we're going to leave symmetric as it's own field. Feels like either symmetric should be deprecated or zp_dtype should be ignored when symmetric is true.

I strongly dislike scale_dtype = None meaning dynamic quantization, that seems entirely unintuitive. While zp_dtype=None could be understood as 'there is no zp'-> symmetric quant, scale_dtype=None has no such logical progression to dynamic quant. It also has the same issue as above with duplicating the information in the dynamic field.

The point is to make it clear in the metadata what is compressed on disk. When doing asymmetric quantization or dynamic quantization, neither the scale or zp are saved or set in the checkpoint. Having them set in the config would be extremely confusing.

You can also run dynamic generations with any fp dtype, depending on how you load your model as it will just match the dtype of the activations. So having it defined in the config doesn't make a lot of sense.

In the case of the zp_dtype, it is ignored if symmetric. It is set as None in the config.

@dsikka dsikka requested a review from kylesayrs November 3, 2025 19:56
@dsikka dsikka requested a review from kylesayrs November 5, 2025 21:34
if device is not None:
weight_packed = weight_packed.to(device)
compressed_dict["weight_packed"] = weight_packed
compressed_dict["weight_scale"] = scale.to(quantization_args.scale_dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be round_to_quantized_type, with the eps replacement? That way you guarantee that the value is properly clamped and non-zero

Copy link
Collaborator Author

@dsikka dsikka Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already being applied when the scale is generated in calculate_qparams. We clamp to the fp8 range but maintain the dense dtype. We then cast to fp8 during compression.

# to further scale the local `scale` parameter
if global_scale is not None:
scale = scale.to(global_scale.dtype) / global_scale
scale = scale / global_scale
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scale is still being implicitly cast to global_scale.dtype, right?

Copy link
Collaborator Author

@dsikka dsikka Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We apply the global_scale in calculate_qparams, so the scale should be fp32 here.

return torch.clamp(tensor, finfo.min, finfo.max).to(dtype)
else:
iinfo = torch.iinfo(dtype)
return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a final cast?

Suggested change
return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max))
return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max)).to(dtype)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use torch.round for all of our ints. I’m maintaining existing functionality

return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max))


def _round_args(tensor: torch.Tensor, args: QuantizationArgs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need to clamp in this case? Maybe there's a way to combine this with _round_dtype?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main, we use this with an outside clamp method, which is what I've kept for now:

)
scales = torch.where(
scales == 0,
torch.tensor(eps, dtype=scales.dtype, device=device),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider folding this torch.tensor into _get_dtype_eps

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not going to do that here. We can consider this in a follow-up

@dsikka dsikka requested a review from kylesayrs November 5, 2025 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants