Enabling MOE Quantization using linear decomposition [WIP] #2043
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enabling MOE Quantization using linear decomposition
Summary: This PR is a first step at optimizing moe inference using
torchAO. The goal for this step is to enable existing quantization
kernels and workflows to work for moe quantization by decomposing the
group gemm into a sequence of unbalanced linear ops that can use the
existing quantized kernels. To enable this we had to add support for
quantizing these 3D tensors as well as slicing and indexing. 2 methods
of achieving this were implemented. for int8wo, int8dq, int4wo, fp8wo,
fp8dq, the underlying quantized tensor subclass was adapted to both
support 3D tensors, indexing and slicing, as well as an updated
transformation function that can handle the
ConditionalFeedForwardAOQuantizable modules if the filter funciton in
quantize_ is used to target the aforementioned module. For some complex kernels
which use packed data that couldn't be made to easily work in 3D, we
also added FakeExtraDimTensor which can transform any
quantized tensor subclass into supporting the necessary slice and index
operations for moe quantization. This option is enabled by using
MoeQuantConfig.
This can be applied to huggingface llama4 for instance as shown int he
llama4_quant.py example. Since the hf moe module is implemented in a way
that's not condusive to quantization, it first requires a module swap to
the MOEFeedForwardAOQuantizable.
TODO final benchmark numbers from run.sh, consolidate 3x implementation
of MOEFeedForwardAOQuantizable and ConditionalFeedForwardAOQuantizable.
verify hqq
Test Plan:
python test/quantization/test_moe_quant.py
python
test/torchao/experimental/tests/test_int8_dynamic_activation_intx_weight.py
-k "test_moe_quant_intx"
sh torchao/_models/mixtral-moe/run.sh
Reviewers:
Subscribers:
Tasks:
Tags: