-
Notifications
You must be signed in to change notification settings - Fork 174
block wise quantization support #1497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look! Left a few comments. it would also be good to make sure block-wise quantization can run on vllm, beyond just making sure the script runs. Apparently some of this is set up in vllm already -- #1475 (comment)
src/llmcompressor/observers/base.py
Outdated
self._scale, self._zero_point = self.calculate_qparams( | ||
observed, tensor_id=None, global_scale=global_scale | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the majority of your logic you'll want to have in here, or in a helper method that this calls to help with readability.
src/llmcompressor/observers/base.py
Outdated
scale_tensor = torch.zeros_like(observed) | ||
zero_point_tensor = torch.zeros_like(observed, dtype=torch.int32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should have shape (rows, num_blocks)
, similar to how group-wise is set up here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, all quantization types and their applications should live in compressed tensors
@dsikka @brian-dellabetta Is there any progress? |
so i tried moving all the code to get_params then got a shape mismatch issue .The implementation has shape mismatches that cause runtime errors when trying to update quantization parameters. |
Hi @ved1beta , thanks for the update, feel free to push the changes. This is a more difficult issue than most of our "good first issue"s. I can take a look when i have some down time. Hi @shuxiaobo , this is lower priority given the other work going on in |
Hi @ved1beta , I am going to close this in favor of the PRs to support block-wise quantization from one of the vllm maintainers. You can see how functionality was added in these PRs:
We appreciate you taking an initial stab at this though. The implementation here is the meat of adding it to llmcompressor, but as you can see from the PRs there are a lot of other things to consider. We're still trying to figure out how best to label good first issues and encourage community involvement |
SUMMARY:
added support for blcok wise quant changes in
def calculate_qparams
fixes #1475
TEST PLAN:
this is repro script from the issue passes
EDIT:
ERROR:
RuntimeError: output with shape [1] doesn't match the broadcast shape [512, 8]