add benchmarks folder and submission guidelines (#1296)

tianyu-l · web-flow · commit 820504e20d11 · 2025-06-13T16:20:32.000-07:00
This is to unblock #1289 and requests from @danielvegamyhre to submit their benchmarking results. The `benchmarks` folder should be the central place to host torchtitan performance results.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ To use the latest features of `torchtitan`, we recommend using the most recent P
 
 
 ## Latest News
-- [2025/04] Our paper has been accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620). The poster will be presented on Friday April 25th.
+- [2025/04] Our paper was accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620).
 - [2025/04] [Llama 4](torchtitan/experiments/llama4/) initial support is available as an experiment.
 - [2025/04] Training the diffusion model [FLUX](torchtitan/experiments/flux/) with FSDP/HSDP is available as an experiment.
 - [2025/04] The frontend implementation of [SimpleFSDP](torchtitan/experiments/simple_fsdp/), a compiler-based FSDP framework, is available as an experiment.
@@ -71,7 +71,7 @@ To accelerate contributions to and innovations around torchtitan, we are hosting
     - estimate FSDP/HSDP memory usage without materializing the model
     - run distributed inference with Tensor Parallel
 
-We report [performance](docs/performance.md) on up to 512 GPUs, and verify [loss converging](docs/converging.md) correctness of various techniques.
+We report [performance](benchmarks/llama3_h100_202412_torchtitan.md) on up to 512 GPUs, and verify [loss converging](docs/converging.md) correctness of various techniques.
 
 ### Dive into the code
 
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -0,0 +1,22 @@
+We welcome the community to submit reproducible benchmarking results.
+
+## Submission Guidelines
+
+A submission should be a file / files including the following information
+
+1. Entity, which could be your name, GitHub username, company, university, team, etc.
+2. The model or theme of benchmarking, e.g. Llama 3.1, Async TP.
+3. The hardware setup, including the types of GPUs, interconnections, etc.
+4. The actual performance report with training configs, e.g. via
+   - `.toml` files / commandline arguments
+   - complete configs, which can be found in the log with [`--print_args`](https://github.com/pytorch/torchtitan/blob/e7c0cae934df78d6e9c2835f42ff1f757dc3fddc/torchtitan/config_manager.py#L47) turned on (preferred as the default value not shown in `.toml` or specified in commandline could change from time to time)
+5. The versions and date/time of `torchtitan`, `torch`, `torchao`, or any relevant dependencies.
+6. Other notes which could help reproduce the results.
+
+The name of the file should follow the format of
+```
+[model/theme]_[hardware]_[date/time]_[entity].md
+```
+For example, `llama3.1_h100_202412_pytorch.md`, `asynctp_256xh100_20250613_alice+bob.md`.
+
+An example can be found at [llama3_h100_202412_torchtitan.md](./llama3_h100_202412_torchtitan.md).
diff --git a/benchmarks/llama3_h100_202412_torchtitan.md b/benchmarks/llama3_h100_202412_torchtitan.md
@@ -1,11 +1,21 @@
+The following performance benchmarks were done by the `torchtitan` team at the end of 2024 using the latest `torch`, `torchao`, and `torchtitan` versions.
+
+### Models
+
+Llama 3.1 8B, 70B, 405B
+
 We demonstrate the effectiveness of elastic distributed training using torchtitan, via experiments on Llama 3.1 8B, 70B, and 405B models, from 1D parallelism to 4D parallelism, at the scale from 8 GPUs to 512 GPUs.
 
+### Hardware
+
 We ran our performance benchmarks on the [Grand Teton platform](https://engineering.fb.com/2022/10/18/open-source/ocp-summit-2022-grand-teton/), where
 - Each host has 8 NVIDIA H100 GPUs fully connected with NVLink.
 - Each H100 GPU is equipped with 96GB HBM2e with 2.4 TB/sec peak memory bandwidth.
 - Hosts are inter-connected with backend RDMA network with 400 Gb/s per GPU.
 - We used the default 500W power limit, although tuning it up to 700W TDP can potentially provide further speedups.
 
+### Results
+
 We note that, throughout our experimentation, memory readings are stable across the whole training process[^1], whereas throughput numbers (TPS/GPU) are calculated and logged every 10 iterations, and always read at the (arbitrarily determined) 90th iteration.
 
 We do not report Model FLOPS Utilization (MFU) because when Float8 is enabled (on `nn.Linear` modules), both BFLOAT16 Tensor Core and FP8 Tensor Core are involved in model training, but they have different peak FLOPS and the definition of MFU under such scenario is not well-defined. We note that the 1D Llama 3.1 8B model training on 8 or 128 H100 GPUs without Float8 achieves 33% to 39% MFU[^2] (with or without torch.compile, respectively).
@@ -58,8 +68,8 @@ We do not report Model FLOPS Utilization (MFU) because when Float8 is enabled (o
 | FSDP 2, CP 4 | 131072 | 31 | 77.1 |
 | FSDP 1, CP 8 | 262144 | 16 | 84.9 |
 
+### Versions and Dates
 
-#### Versions used for performance testing
 | repo | commit | date |
 | --- | --- | --- |
 | torch | [1963fc8](https://github.com/pytorch/pytorch/commit/1963fc83a1c32e162162e2414f78b043f0674bae) | 2024/12/23 |