Questions about fp8 quantization in Deepseek v3 #1402

vHitsuji · 2025-07-16T01:42:24Z

vHitsuji
Jul 16, 2025

Subject: Question on FP8 Quantization in DeepSeek-v3 (Auto-wrapping vs Manual Handling in MOE)

Hi Torchtitan team and contributors,

Thank you for the impressive work on DeepSeek-v3 and the torchtitan implementation.

While studying the code, I had a few questions regarding FP8 quantization integration:

It appears that most of the model is automatically wrapped for FP8 using TorchAO, as specified in the training config file

torchtitan/torchtitan/experiments/deepseek_v3/train_configs/deepseek_v2.toml

Lines 70 to 73 in 61ef5cf

    
           [float8] 
        
           enable_fsdp_float8_all_gather = false 
        
           precompute_float8_dynamic_scale_for_fsdp = false 
        
           filter_fqns = ["output", "router.gate"]

However, I noticed there’s also manual handling of quantization specifically for the MLP layers in the MoE module, as seen here in group_gemms.py (my understanding here is we need precise control over quantization for MOE, especially grouping).

Could you clarify how these two approaches—TorchAO's automatic FP8 wrapping and the manual quantization in the MoE—are intended to work together?

Additionally, I saw that the output and router.gate modules are explicitly excluded from FP8 quantization in the config. What's the reasoning behind excluding those components?

Thanks again for the excellent work and for open-sourcing this—it’s been incredibly educational to dig into.

Best regards,
vHitsuji

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about fp8 quantization in Deepseek v3 #1402

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Questions about fp8 quantization in Deepseek v3 #1402

Uh oh!

Uh oh!

vHitsuji Jul 16, 2025

Replies: 0 comments

vHitsuji
Jul 16, 2025