Skip to content

Commit 820504e

Browse files
authored
add benchmarks folder and submission guidelines (#1296)
This is to unblock #1289 and requests from @danielvegamyhre to submit their benchmarking results. The `benchmarks` folder should be the central place to host torchtitan performance results.
1 parent bf835b5 commit 820504e

File tree

3 files changed

+35
-3
lines changed

3 files changed

+35
-3
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ To use the latest features of `torchtitan`, we recommend using the most recent P
1717

1818

1919
## Latest News
20-
- [2025/04] Our paper has been accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620). The poster will be presented on Friday April 25th.
20+
- [2025/04] Our paper was accepted by [ICLR 2025](https://iclr.cc/virtual/2025/poster/29620).
2121
- [2025/04] [Llama 4](torchtitan/experiments/llama4/) initial support is available as an experiment.
2222
- [2025/04] Training the diffusion model [FLUX](torchtitan/experiments/flux/) with FSDP/HSDP is available as an experiment.
2323
- [2025/04] The frontend implementation of [SimpleFSDP](torchtitan/experiments/simple_fsdp/), a compiler-based FSDP framework, is available as an experiment.
@@ -71,7 +71,7 @@ To accelerate contributions to and innovations around torchtitan, we are hosting
7171
- estimate FSDP/HSDP memory usage without materializing the model
7272
- run distributed inference with Tensor Parallel
7373

74-
We report [performance](docs/performance.md) on up to 512 GPUs, and verify [loss converging](docs/converging.md) correctness of various techniques.
74+
We report [performance](benchmarks/llama3_h100_202412_torchtitan.md) on up to 512 GPUs, and verify [loss converging](docs/converging.md) correctness of various techniques.
7575

7676
### Dive into the code
7777

benchmarks/README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
We welcome the community to submit reproducible benchmarking results.
2+
3+
## Submission Guidelines
4+
5+
A submission should be a file / files including the following information
6+
7+
1. Entity, which could be your name, GitHub username, company, university, team, etc.
8+
2. The model or theme of benchmarking, e.g. Llama 3.1, Async TP.
9+
3. The hardware setup, including the types of GPUs, interconnections, etc.
10+
4. The actual performance report with training configs, e.g. via
11+
- `.toml` files / commandline arguments
12+
- complete configs, which can be found in the log with [`--print_args`](https://github.com/pytorch/torchtitan/blob/e7c0cae934df78d6e9c2835f42ff1f757dc3fddc/torchtitan/config_manager.py#L47) turned on (preferred as the default value not shown in `.toml` or specified in commandline could change from time to time)
13+
5. The versions and date/time of `torchtitan`, `torch`, `torchao`, or any relevant dependencies.
14+
6. Other notes which could help reproduce the results.
15+
16+
The name of the file should follow the format of
17+
```
18+
[model/theme]_[hardware]_[date/time]_[entity].md
19+
```
20+
For example, `llama3.1_h100_202412_pytorch.md`, `asynctp_256xh100_20250613_alice+bob.md`.
21+
22+
An example can be found at [llama3_h100_202412_torchtitan.md](./llama3_h100_202412_torchtitan.md).

docs/performance.md renamed to benchmarks/llama3_h100_202412_torchtitan.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,21 @@
1+
The following performance benchmarks were done by the `torchtitan` team at the end of 2024 using the latest `torch`, `torchao`, and `torchtitan` versions.
2+
3+
### Models
4+
5+
Llama 3.1 8B, 70B, 405B
6+
17
We demonstrate the effectiveness of elastic distributed training using torchtitan, via experiments on Llama 3.1 8B, 70B, and 405B models, from 1D parallelism to 4D parallelism, at the scale from 8 GPUs to 512 GPUs.
28

9+
### Hardware
10+
311
We ran our performance benchmarks on the [Grand Teton platform](https://engineering.fb.com/2022/10/18/open-source/ocp-summit-2022-grand-teton/), where
412
- Each host has 8 NVIDIA H100 GPUs fully connected with NVLink.
513
- Each H100 GPU is equipped with 96GB HBM2e with 2.4 TB/sec peak memory bandwidth.
614
- Hosts are inter-connected with backend RDMA network with 400 Gb/s per GPU.
715
- We used the default 500W power limit, although tuning it up to 700W TDP can potentially provide further speedups.
816

17+
### Results
18+
919
We note that, throughout our experimentation, memory readings are stable across the whole training process[^1], whereas throughput numbers (TPS/GPU) are calculated and logged every 10 iterations, and always read at the (arbitrarily determined) 90th iteration.
1020

1121
We do not report Model FLOPS Utilization (MFU) because when Float8 is enabled (on `nn.Linear` modules), both BFLOAT16 Tensor Core and FP8 Tensor Core are involved in model training, but they have different peak FLOPS and the definition of MFU under such scenario is not well-defined. We note that the 1D Llama 3.1 8B model training on 8 or 128 H100 GPUs without Float8 achieves 33% to 39% MFU[^2] (with or without torch.compile, respectively).
@@ -58,8 +68,8 @@ We do not report Model FLOPS Utilization (MFU) because when Float8 is enabled (o
5868
| FSDP 2, CP 4 | 131072 | 31 | 77.1 |
5969
| FSDP 1, CP 8 | 262144 | 16 | 84.9 |
6070

71+
### Versions and Dates
6172

62-
#### Versions used for performance testing
6373
| repo | commit | date |
6474
| --- | --- | --- |
6575
| torch | [1963fc8](https://github.com/pytorch/pytorch/commit/1963fc83a1c32e162162e2414f78b043f0674bae) | 2024/12/23 |

0 commit comments

Comments
 (0)