You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, I noticed there’s also manual handling of quantization specifically for the MLP layers in the MoE module, as seen here in group_gemms.py (my understanding here is we need precise control over quantization for MOE, especially grouping).
Could you clarify how these two approaches—TorchAO's automatic FP8 wrapping and the manual quantization in the MoE—are intended to work together?
Additionally, I saw that the output and router.gate modules are explicitly excluded from FP8 quantization in the config. What's the reasoning behind excluding those components?
Thanks again for the excellent work and for open-sourcing this—it’s been incredibly educational to dig into.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Subject: Question on FP8 Quantization in DeepSeek-v3 (Auto-wrapping vs Manual Handling in MOE)
Hi Torchtitan team and contributors,
Thank you for the impressive work on DeepSeek-v3 and the torchtitan implementation.
While studying the code, I had a few questions regarding FP8 quantization integration:
It appears that most of the model is automatically wrapped for FP8 using TorchAO, as specified in the training config file
torchtitan/torchtitan/experiments/deepseek_v3/train_configs/deepseek_v2.toml
Lines 70 to 73 in 61ef5cf
However, I noticed there’s also manual handling of quantization specifically for the MLP layers in the MoE module, as seen here in
group_gemms.py
(my understanding here is we need precise control over quantization for MOE, especially grouping).Could you clarify how these two approaches—TorchAO's automatic FP8 wrapping and the manual quantization in the MoE—are intended to work together?
Additionally, I saw that the
output
androuter.gate
modules are explicitly excluded from FP8 quantization in the config. What's the reasoning behind excluding those components?Thanks again for the excellent work and for open-sourcing this—it’s been incredibly educational to dig into.
Best regards,
vHitsuji
Beta Was this translation helpful? Give feedback.
All reactions