-
Notifications
You must be signed in to change notification settings - Fork 36
[Quantization Args] Add scale and zp dtype #508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Dipika todo: should try a w4a16 with zp to make sure it is saved correctly |
|
Im unsure about the zp_dtype = None meaning symmetric quantization if we're going to leave symmetric as it's own field. Feels like either symmetric should be deprecated or zp_dtype should be ignored when symmetric is true. I strongly dislike scale_dtype = None meaning dynamic quantization, that seems entirely unintuitive. While zp_dtype=None could be understood as 'there is no zp'-> symmetric quant, scale_dtype=None has no such logical progression to dynamic quant. It also has the same issue as above with duplicating the information in the dynamic field. |
The point is to make it clear in the metadata what is compressed on disk. When doing asymmetric quantization or dynamic quantization, neither the scale or zp are saved or set in the checkpoint. Having them set in the config would be extremely confusing. You can also run dynamic generations with any fp dtype, depending on how you load your model as it will just match the dtype of the activations. So having it defined in the config doesn't make a lot of sense. In the case of the zp_dtype, it is ignored if symmetric. It is set as None in the config. |
src/compressed_tensors/compressors/quantized_compressors/fp4_quantized.py
Outdated
Show resolved
Hide resolved
| if device is not None: | ||
| weight_packed = weight_packed.to(device) | ||
| compressed_dict["weight_packed"] = weight_packed | ||
| compressed_dict["weight_scale"] = scale.to(quantization_args.scale_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be round_to_quantized_type, with the eps replacement? That way you guarantee that the value is properly clamped and non-zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already being applied when the scale is generated in calculate_qparams. We clamp to the fp8 range but maintain the dense dtype. We then cast to fp8 during compression.
| # to further scale the local `scale` parameter | ||
| if global_scale is not None: | ||
| scale = scale.to(global_scale.dtype) / global_scale | ||
| scale = scale / global_scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scale is still being implicitly cast to global_scale.dtype, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We apply the global_scale in calculate_qparams, so the scale should be fp32 here.
| return torch.clamp(tensor, finfo.min, finfo.max).to(dtype) | ||
| else: | ||
| iinfo = torch.iinfo(dtype) | ||
| return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need a final cast?
| return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max)) | |
| return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max)).to(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use torch.round for all of our ints. I’m maintaining existing functionality
| return torch.round(torch.clamp(tensor, iinfo.min, iinfo.max)) | ||
|
|
||
|
|
||
| def _round_args(tensor: torch.Tensor, args: QuantizationArgs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need to clamp in this case? Maybe there's a way to combine this with _round_dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On main, we use this with an outside clamp method, which is what I've kept for now:
| clamped_value = torch.clamp( |
| ) | ||
| scales = torch.where( | ||
| scales == 0, | ||
| torch.tensor(eps, dtype=scales.dtype, device=device), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider folding this torch.tensor into _get_dtype_eps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not going to do that here. We can consider this in a follow-up
Summary
is_fp4and some of the fp4 specific functionality that was tied closely to the global scale generation- We are not applying this logic for now but would like to discuss with the team to gather thoughts:
zp_dtypetoNoneif running symmetric quantization.scale_dtypetoNoneif running dynamic or local quantization.Question:
Example Updates:
KV Cache Scheme:
NVFP4:
FP8 Dynamic:
W4A16 + Asym