-
Notifications
You must be signed in to change notification settings - Fork 480
[Feat]Unquantized Linear to nz and control all nz-cast #3356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new environment variable VLLM_ASCEND_ENABLE_NZ
to control the conversion of weights to the FRACTAL_NZ format, which is a valuable addition for performance tuning on Ascend hardware. The changes are applied consistently across various quantization methods and models. However, I've identified a few critical issues in the test files that would prevent the test suite from running, and a potential logic bug in vllm_ascend/attention/mla_v1.py
involving dead code and an incorrect format constant. These issues need to be addressed to ensure the correctness and stability of the codebase.
tests/ut/ops/test_linear.py
Outdated
linear = AscendReplicatedLinear( | ||
input_size=16, | ||
output_size=8, | ||
) | ||
self.assertTrue(isinstance(linear.quant_method, | ||
AscendUnquantizedLinearMethod)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is at the class level, which will cause a NameError
because self
is not defined in this context. This code should be moved inside a test method, for example test_init
.
linear = AscendReplicatedLinear( | |
input_size=16, | |
output_size=8, | |
) | |
self.assertTrue(isinstance(linear.quant_method, | |
AscendUnquantizedLinearMethod)) | |
def test_init(self): | |
linear = AscendReplicatedLinear( | |
input_size=16, | |
output_size=8, | |
) | |
self.assertTrue(isinstance(linear.quant_method, | |
AscendUnquantizedLinearMethod)) |
vllm_ascend/attention/mla_v1.py
Outdated
elif isinstance(layer.quant_method, AscendUnquantizedLinearMethod): | ||
if getattr(layer.quant_method, "unquant_to_nz", False): | ||
layer.weight.data = torch_npu.npu_format_cast( | ||
layer.weight.data, ACL_FORMAT_FRACTAL_ND) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code appears to be dead code. The condition getattr(layer.quant_method, "unquant_to_nz", False)
will likely never be true because AscendUnquantizedLinearMethod.process_weights_after_loading
sets self.unquant_to_nz = False
.
Furthermore, even if this code were to be executed, it casts the weight to ACL_FORMAT_FRACTAL_ND
, which is inconsistent with the pull request's goal of converting to FRACTAL_NZ
.
If this logic is intended to be used, please correct the condition and the format. Otherwise, it should be removed.
vllm_ascend/quantization/w8a8.py
Outdated
layer.weight.data = layer.weight.data.transpose(0, 1).contiguous() | ||
layer.weight.data = torch_npu.npu_format_cast(layer.weight.data, | ||
ACL_FORMAT_FRACTAL_NZ) | ||
if envs_ascend.VLLM_ASCEND_ENABLE_NZ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check can be extracted into a common function.
vllm_ascend/ops/linear.py
Outdated
"8,3"): | ||
layer.weight.data = torch_npu.npu_format_cast( | ||
layer.weight.data, ACL_FORMAT_FRACTAL_NZ) | ||
self.unquant_to_nz = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may can remove this parameter
4b07f81
to
6a3575f
Compare
24113ff
to
eb634f2
Compare
why not set NZ by default, but instead add an environment variable for control? |
|
||
|
||
class CustomRowParallelOp(CustomTensorParallelOp): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont need inherit this base class
eb634f2
to
af880aa
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
af880aa
to
eebe4da
Compare
What this PR does / why we need it?
Currently, when executing to the Linear layer of models in vLLM-Ascend, the weights format is ND in unquantized case and skipped ascend case.
This PR supplements the execution logic for Linear layer. We use a new global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as w8a8-quantized case.
Does this PR introduce any user-facing change?
Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ format, you should set VLLM_ASCEND_ENABLE_NZ=1.
How was this patch tested?