Skip to content

[float8 moe training] validate float8 moe parallelism config #1360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 2, 2025

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Jul 1, 2025

Summary

Validate only FSDP, HSDP are used for float8 MoE training. TP support is in progress and CP/PP are untested. 2D+ parallelism are untested as well.

Test plan

  • Command: NGPU=4 CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.steps=10 --model.converters="float8" --float8.recipe_name="rowwise" --float8.moe_fqns_prototype="experts" --parallelism.tensor_parallel_degree=2
  • Error: AssertionError: Float8 MoE training prototype does not yet support tensor parallelism

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 1, 2025
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe need to add EP assertion after #1324 lands.

@danielvegamyhre danielvegamyhre merged commit c08c9d4 into pytorch:main Jul 2, 2025
7 checks passed
mori360 pushed a commit to mori360/torchtitan that referenced this pull request Jul 8, 2025
…#1360)

## Summary
Validate only FSDP, HSDP are used for float8 MoE training. TP support is
in progress and CP/PP are untested. 2D+ parallelism are untested as
well.

## Test plan
- Command: `NGPU=4
CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml"
./run_train.sh --training.steps=10 --model.converters="float8"
--float8.recipe_name="rowwise" --float8.moe_fqns_prototype="experts"
--parallelism.tensor_parallel_degree=2`
- Error: `AssertionError: Float8 MoE training prototype does not yet
support tensor parallelism`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. module: float8
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants