[float8 moe training] validate float8 moe parallelism config #1360

danielvegamyhre · 2025-07-01T22:00:29Z

Summary

Validate only FSDP, HSDP are used for float8 MoE training. TP support is in progress and CP/PP are untested. 2D+ parallelism are untested as well.

Test plan

Command: NGPU=4 CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.steps=10 --model.converters="float8" --float8.recipe_name="rowwise" --float8.moe_fqns_prototype="experts" --parallelism.tensor_parallel_degree=2
Error: AssertionError: Float8 MoE training prototype does not yet support tensor parallelism

tianyu-l

Maybe need to add EP assertion after #1324 lands.

…#1360) ## Summary Validate only FSDP, HSDP are used for float8 MoE training. TP support is in progress and CP/PP are untested. 2D+ parallelism are untested as well. ## Test plan - Command: `NGPU=4 CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.steps=10 --model.converters="float8" --float8.recipe_name="rowwise" --float8.moe_fqns_prototype="experts" --parallelism.tensor_parallel_degree=2` - Error: `AssertionError: Float8 MoE training prototype does not yet support tensor parallelism`

validate float8 moe parallelism config

5ba9e6d

danielvegamyhre requested review from tianyu-l, fegin, wwwjn and wconstab as code owners July 1, 2025 22:00

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 1, 2025

danielvegamyhre added the module: float8 label Jul 1, 2025

tianyu-l approved these changes Jul 2, 2025

View reviewed changes

danielvegamyhre merged commit c08c9d4 into pytorch:main Jul 2, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[float8 moe training] validate float8 moe parallelism config #1360

[float8 moe training] validate float8 moe parallelism config #1360

Uh oh!

danielvegamyhre commented Jul 1, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Uh oh!

[float8 moe training] validate float8 moe parallelism config #1360

[float8 moe training] validate float8 moe parallelism config #1360

Uh oh!

Conversation

danielvegamyhre commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Jul 1, 2025 •

edited

Loading