make mxfp8 dim1 cast kernel configurable #1427

danielvegamyhre · 2025-07-21T17:17:36Z

Stacked PRs:

->make mxfp8 dim1 cast kernel configurable #1427

make mxfp8 dim1 cast kernel configurable

Summary

We recently added a new CUDA kernel for the mxfp8 dim1 cast which is ~1.4x faster than the existing Triton kernel or torch.compile, and using it results in an e2e training speedup of +1.5-2.5% TPS with Llama3 8b using FSDP=4/8 (Add CUDA kernel for MXFP8 dim1 casting ao#2513). The integration work for composability with torch.compile + FSDP is complete as well: integration of new mxfp8 casting cuda kernel ao#2564
This PR updates the mxfp8 user facing API to replace the boolean flag "--mx.use_triton_for_dim1_cast=[true|false] to mxfp8_dim1_cast_kernel_choice=[triton|cuda|torch]

Test plan

Triton: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --training.steps=100 --model.converters="mx" --mx.recipe_name="mxfp8" --training.compile --mx.mxfp8_dim1_cast_kernel_choice="triton"
Cuda: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --training.steps=100 --model.converters="mx" --mx.recipe_name="mxfp8" --training.compile --mx.mxfp8_dim1_cast_kernel_choice="cuda"
Torch: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh --training.steps=100 --model.converters="mx" --mx.recipe_name="mxfp8" --training.compile --mx.mxfp8_dim1_cast_kernel_choice="torch"

Limitations

TP is currently not supported yet, as both the Triton kernel and CUDA kernel are affected by an issue: RuntimeError: Attempting to use FunctionalTensor on its own. Instead, please use it with a corresponding FunctionalTensorMode(). This is a known issue we were talking to Brian about, will continue following up on it.

stack-info: PR: #1427, branch: danielvegamyhre/stack/1

tianyu-l · 2025-07-21T18:45:27Z

torchtitan/config_manager.py

@@ -556,7 +556,7 @@ class Float8:

 @dataclass
 class MX:
-    use_fp8_dim1_cast_triton_kernel: bool = True
+    mxfp8_dim1_cast_kernel_choice: Literal["triton", "cuda", "torch"] = "triton"


what's the benefit of letting user choose?

It makes it easy for torchao developers to benchmark torch.compile vs triton vs cuda implementations as we work on perf improvements, especially on improving torch.compile performance for casts.

Users may be using a python only torchao installation that doesn't include the CUDA kernel. This is probably not common but still worth considering.

stack-info: PR: #1427, branch: danielvegamyhre/stack/1

danielvegamyhre · 2025-07-22T15:08:10Z

@vkuzo @tianyu-l this is ready for review

CI error is unrelated to this change:

Exception: Integration test failed, flavor : 2D eager, command : TORCH_TRACE="artifacts-to-be-uploaded/2d_eager/compile_trace" CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml NGPU=4 LOG_RANK=0,1,2,3 ./run_train.sh --job.dump_folder artifacts-to-be-uploaded/2d_eager --parallelism.tensor_parallel_degree 2

vkuzo · 2025-07-24T15:31:58Z

lgtm, I'll let someone from titan team accept

tianyu-l

TP is currently not supported yet, as both the Triton kernel and CUDA kernel are affected by an issue: RuntimeError: Attempting to use FunctionalTensor on its own. Instead, please use it with a corresponding FunctionalTensorMode(). This is a known issue we were talking to Brian about, will continue following up on it.

Do you think we can error out in mx.py, since you do have JobConfig on tp info?

Please rebase before merge.

stack-info: PR: #1427, branch: danielvegamyhre/stack/1

danielvegamyhre · 2025-07-25T01:30:12Z

confirmed error is CUDA driver error in async TP, which is not related to this change. as an aside, it looks like an error at the C++ symm mem level, or perhaps the containerized env running the test had dependencies updated?

danielvegamyhre · 2025-07-25T01:30:57Z

Do you think we can error out in mx.py, since you do have JobConfig on tp info?

Sure, done

danielvegamyhre requested review from tianyu-l, fegin, wwwjn and wconstab as code owners July 21, 2025 17:17

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 21, 2025

danielvegamyhre force-pushed the danielvegamyhre/stack/1 branch from 50e3ade to 372b083 Compare July 21, 2025 17:17

danielvegamyhre mentioned this pull request Jul 21, 2025

update api name #1428

Closed

danielvegamyhre added a commit that referenced this pull request Jul 21, 2025

make mxfp8 dim1 cast kernel configurable

24c3d3a

stack-info: PR: #1427, branch: danielvegamyhre/stack/1

danielvegamyhre force-pushed the danielvegamyhre/stack/1 branch from 372b083 to 24c3d3a Compare July 21, 2025 17:19

tianyu-l reviewed Jul 21, 2025

View reviewed changes

danielvegamyhre marked this pull request as draft July 21, 2025 19:31

danielvegamyhre added a commit that referenced this pull request Jul 22, 2025

make mxfp8 dim1 cast kernel configurable

b4b53cb

stack-info: PR: #1427, branch: danielvegamyhre/stack/1

danielvegamyhre force-pushed the danielvegamyhre/stack/1 branch from 24c3d3a to b4b53cb Compare July 22, 2025 03:33

danielvegamyhre marked this pull request as ready for review July 22, 2025 03:33

danielvegamyhre requested review from vkuzo and tianyu-l July 22, 2025 03:34

tianyu-l approved these changes Jul 24, 2025

View reviewed changes

danielvegamyhre added a commit that referenced this pull request Jul 24, 2025

make mxfp8 dim1 cast kernel configurable

a6466e7

stack-info: PR: #1427, branch: danielvegamyhre/stack/1

danielvegamyhre force-pushed the danielvegamyhre/stack/1 branch from b4b53cb to a6466e7 Compare July 24, 2025 23:42

danielvegamyhre added a commit that referenced this pull request Jul 24, 2025

make mxfp8 dim1 cast kernel configurable

f79f833

stack-info: PR: #1427, branch: danielvegamyhre/stack/1

danielvegamyhre force-pushed the danielvegamyhre/stack/1 branch from a6466e7 to f79f833 Compare July 24, 2025 23:44

make mxfp8 dim1 cast kernel configurable

4806fdb

stack-info: PR: #1427, branch: danielvegamyhre/stack/1

danielvegamyhre force-pushed the danielvegamyhre/stack/1 branch from f79f833 to 4806fdb Compare July 24, 2025 23:49

danielvegamyhre merged commit f3e2a75 into main Jul 25, 2025
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make mxfp8 dim1 cast kernel configurable #1427

make mxfp8 dim1 cast kernel configurable #1427

Uh oh!

danielvegamyhre commented Jul 21, 2025 •

edited

Loading

Uh oh!

tianyu-l Jul 21, 2025

Uh oh!

danielvegamyhre Jul 21, 2025

Uh oh!

danielvegamyhre commented Jul 22, 2025

Uh oh!

vkuzo commented Jul 24, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

danielvegamyhre commented Jul 25, 2025

Uh oh!

danielvegamyhre commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

make mxfp8 dim1 cast kernel configurable #1427

make mxfp8 dim1 cast kernel configurable #1427

Uh oh!

Conversation

danielvegamyhre commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Limitations

Uh oh!

tianyu-l Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Jul 22, 2025

Uh oh!

vkuzo commented Jul 24, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Jul 25, 2025

Uh oh!

danielvegamyhre commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Jul 21, 2025 •

edited

Loading