Skip to content

block wise quantization support #1497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ved1beta
Copy link
Contributor

@ved1beta ved1beta commented Jun 1, 2025

SUMMARY:
added support for blcok wise quant changes in def calculate_qparams

def calculate_qparams(
       self,
       observed: Tensor,
       reduce_dims: Optional[Tuple[int]] = None,
       tensor_id: Optional[Any] = None,
       global_scale: Optional[Tensor] = None,
   ) -> Tuple[FloatTensor, IntTensor]: 

fixes #1475

TEST PLAN:
this is repro script from the issue passes


from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
form transformers import AutoModelForCausalLM
from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationConfig,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"

# define a llmcompressor recipe for FP8 W8A8 quantization
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = [
    QuantizationModifier(
        ignore=["lm_head", "re:.*mlp.gate$"],
        config_groups={
            "group_0": QuantizationScheme(
                targets=["Linear"],
                weights=QuantizationArgs(
                    num_bits=4,
                    type=QuantizationType.INT,
                    dynamic=False,
                    symmetric=False,
                    strategy=QuantizationStrategy.BLOCK,
                    # group_size=128,
                    block_structure="128x128",
                ),
            )
        },
    )
]

SAVE_DIR = MODEL_ID + "-W4A16-BLOCK128"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="bfloat16", trust_remote_code=True
)


oneshot(
    model=model,
    recipe=recipe,
    save_compressed=True,
    output_dir=SAVE_DIR,
)

EDIT:
ERROR: RuntimeError: output with shape [1] doesn't match the broadcast shape [512, 8]

Copy link

github-actions bot commented Jun 1, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look! Left a few comments. it would also be good to make sure block-wise quantization can run on vllm, beyond just making sure the script runs. Apparently some of this is set up in vllm already -- #1475 (comment)

Comment on lines 311 to 313
self._scale, self._zero_point = self.calculate_qparams(
observed, tensor_id=None, global_scale=global_scale
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the majority of your logic you'll want to have in here, or in a helper method that this calls to help with readability.

Comment on lines 122 to 123
scale_tensor = torch.zeros_like(observed)
zero_point_tensor = torch.zeros_like(observed, dtype=torch.int32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should have shape (rows, num_blocks), similar to how group-wise is set up here

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, all quantization types and their applications should live in compressed tensors

@shuxiaobo
Copy link

@dsikka @brian-dellabetta Is there any progress?

@ved1beta
Copy link
Contributor Author

ved1beta commented Jun 9, 2025

so i tried moving all the code to get_params then got a shape mismatch issue .The implementation has shape mismatches that cause runtime errors when trying to update quantization parameters.
The main error is: output with shape [1] doesn't match the broadcast shape [512, 8] or [768, 6]. This happens because a scalar parameter is being used where a 2D tensor with specific dimensions is expected.
Should i push the changes for you guys to have a look ? i am not sure what todo next

@brian-dellabetta
Copy link
Collaborator

Hi @ved1beta , thanks for the update, feel free to push the changes. This is a more difficult issue than most of our "good first issue"s. I can take a look when i have some down time.

Hi @shuxiaobo , this is lower priority given the other work going on in llm-compressor, so might take some time. And we'll have to figure out which configurations are optimized to work well in vllm. Consider using group instead of block for your runs, which has good support already.

@ved1beta ved1beta requested a review from dsikka June 11, 2025 09:34
@brian-dellabetta
Copy link
Collaborator

brian-dellabetta commented Jun 30, 2025

Hi @ved1beta , I am going to close this in favor of the PRs to support block-wise quantization from one of the vllm maintainers. You can see how functionality was added in these PRs:

We appreciate you taking an initial stab at this though. The implementation here is the meat of adding it to llmcompressor, but as you can see from the PRs there are a lot of other things to consider. We're still trying to figure out how best to label good first issues and encourage community involvement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Block-wise Quantization Not supported
4 participants