-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
Consolidate MoE quantization parameters into FusedMoeQuantConfig #19396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Consolidate MoE quantization parameters into FusedMoeQuantConfig #19396
Conversation
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…#19303) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
…Config Consolidates multiple boolean quantization parameters (use_fp8_w8a8, use_int8_w8a8, use_int8_w8a16, use_int4_w4a16, per_channel_quant, block_shape) into a single type-safe FusedMoeQuantConfig object across fused_experts, invoke_fused_moe_kernel, and fused_moe functions. Key improvements: - Type-safe configuration with QuantizationType enum - Factory methods for common quantization patterns - Built-in validation preventing conflicting configurations - Seamless backward compatibility with deprecation warnings - Performance optimizations with cached properties - Cleaner, more maintainable API for future extensions 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
b3d520e
to
e30d84c
Compare
# Deprecated: keep for backward compatibility | ||
use_fp8_w8a8: Optional[bool] = None, | ||
use_int8_w8a8: Optional[bool] = None, | ||
use_int8_w8a16: Optional[bool] = None, | ||
use_int4_w4a16: Optional[bool] = None, | ||
per_channel_quant: Optional[bool] = None, | ||
block_shape: Optional[list[int]] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just remove these since the interface is internal and we should fix all the usage in the codebase
top_k: int, | ||
config: dict[str, Any], | ||
compute_type: tl.dtype, | ||
fused_moe_quant_config: Optional[FusedMoeQuantConfig] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just name is quant_config
if fused_moe_quant_config.use_fp8_w8a8 or fused_moe_quant_config.use_int8_w8a8: | ||
assert B_scale is not None | ||
assert (block_shape is None or triton.cdiv(B.shape[-2], block_shape[0]) | ||
assert (fused_moe_quant_config.block_shape is None or triton.cdiv( | ||
B.shape[-2], fused_moe_quant_config.block_shape[0]) | ||
== B_scale.shape[-2]) | ||
assert (block_shape is None or triton.cdiv(B.shape[-1], block_shape[1]) | ||
assert (fused_moe_quant_config.block_shape is None or triton.cdiv( | ||
B.shape[-1], fused_moe_quant_config.block_shape[1]) | ||
== B_scale.shape[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get rid of a lot of these changes by just pulling it out to a local var i.e. block_shape = fused_moe_quant_config.block_shape
per_channel_quant=per_channel_quant or False, | ||
block_shape=block_shape) | ||
|
||
if fused_moe_quant_config.use_fp8_w8a8 or fused_moe_quant_config.use_int8_w8a8: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to have a general check interface for multiple values like quant_config.quant_type in (QuantizationType.FP8_W8A8, QuantizationType.INT8_W8A8)
Separately, maybe we can shorten QuantizationType -> QuantType
use_fp8_w8a8=use_fp8_w8a8, | ||
use_int8_w8a8=use_int8_w8a8, | ||
use_int8_w8a16=use_int8_w8a16, | ||
use_int4_w4a16=use_int4_w4a16, | ||
per_channel_quant=per_channel_quant, | ||
use_fp8_w8a8=fused_moe_quant_config.use_fp8_w8a8, | ||
use_int8_w8a8=fused_moe_quant_config.use_int8_w8a8, | ||
use_int8_w8a16=fused_moe_quant_config.use_int8_w8a16, | ||
use_int4_w4a16=fused_moe_quant_config.use_int4_w4a16, | ||
per_channel_quant=fused_moe_quant_config.per_channel_quant, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of an internal usage that we should just be able to pass in quant_config here
This pull request has merge conflicts that must be resolved before it can be |
Summary
This PR refactors the FusedMoE quantization system by consolidating multiple boolean parameters into a single, type-safe configuration object. This addresses the proliferation of
use_*
flags across MoE functions and provides a cleaner, more maintainable API.Problem
The current MoE quantization API suffers from several issues:
Before (❌ Problems):
Issues:
use_*=True
flags is undefinedSolution
After (✅ Improvements):
Key Features
🎯 Type-Safe Configuration
🏭 Factory Methods for Common Patterns
🔒 Built-in Validation
🔄 Seamless Backward Compatibility
⚡ Performance Optimizations
Migration Guide
Current users: No action required - your code will continue to work with deprecation warnings.
New users: Use the factory methods for better type safety:
Functions Refactored
fused_experts()
- Core MoE expert computationinvoke_fused_moe_kernel()
- Low-level kernel invocationfused_moe()
- High-level MoE interfaceTritonExperts.__init__()
- Triton-based expert implementationImpact
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com