[WIP] Document MX FP8 recipe #1350

lessw2020 · 2025-06-27T05:05:58Z

In progress - let's show how to use mxfp8 with Titan.

tianyu-l · 2025-06-28T18:50:50Z

docs/mxfp8.md

+
+MXFP8 training can provide substantial training speedups for models running on Nvidia Blackwell architecture (G and B200s+).  MX FP8 enables fine grained quantization, where 1 x 32 elements are quantized per a single U8ME0 scaling, and this scaling can be done via hardware.
+
+We have tested MXFP8 training at 1856 GPU Scale (Crusoe B200 cluster) and for Llama 3 70B model, we observed ~ 19% speedup with near equal or better convergence loss relative to BF16.


I think it's better if such claim (and most of this doc) is put in https://github.com/pytorch/torchtitan/tree/main/benchmarks

draft mxfp8

9610d27

lessw2020 requested review from tianyu-l, fegin and wwwjn as code owners June 27, 2025 05:05

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 27, 2025

tianyu-l reviewed Jun 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Document MX FP8 recipe #1350

[WIP] Document MX FP8 recipe #1350

Uh oh!

lessw2020 commented Jun 27, 2025

Uh oh!

tianyu-l Jun 28, 2025

Uh oh!

Uh oh!


		MXFP8 training can provide substantial training speedups for models running on Nvidia Blackwell architecture (G and B200s+). MX FP8 enables fine grained quantization, where 1 x 32 elements are quantized per a single U8ME0 scaling, and this scaling can be done via hardware.

		We have tested MXFP8 training at 1856 GPU Scale (Crusoe B200 cluster) and for Llama 3 70B model, we observed ~ 19% speedup with near equal or better convergence loss relative to BF16.

[WIP] Document MX FP8 recipe #1350

Are you sure you want to change the base?

[WIP] Document MX FP8 recipe #1350

Uh oh!

Conversation

lessw2020 commented Jun 27, 2025

Uh oh!

tianyu-l Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!