Skip to content

Commit cbccb38

Browse files
authored
[benchmark] add h200 bench (#1361)
DO NOT MERGE: WIP This is a baseline for multi-node pretraining on H200s, since currently there don't see seem to be any numbers out for H200.
1 parent 3ca7041 commit cbccb38

File tree

1 file changed

+54
-0
lines changed

1 file changed

+54
-0
lines changed
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
This was performed by Trainy team on WhiteFiber in June 2025, to get a baseline of performance
2+
of the Trainy platform on H200s platform over multiple hosts.
3+
4+
### Models
5+
6+
Llama 3.1 8B
7+
8+
### Hardware
9+
10+
Each host has
11+
12+
- 8 NVIDIA H200 GPUs connected via NVLink.
13+
- Hosts are inter-connected with a backend RDMA fabric with 400Gb/s (Mellanox CX-7) per GPU.
14+
15+
### Configuration
16+
17+
Runs were invoked with the following, where `NUM_NODES` was `4` and `8`
18+
```
19+
torchrun \
20+
--nnodes $NUM_NODES \
21+
--nproc_per_node 8 \
22+
--rdzv_id 101 \
23+
--rdzv_backend c10d \
24+
--rdzv_endpoint "$MASTER_ADDR:29500" \
25+
torchtitan/train.py \
26+
--job.config-file torchtitan/models/llama3/train_configs/llama3_8b.toml \
27+
--metrics.enable_wandb \
28+
--training.local_batch_size=2 \
29+
--training.compile \
30+
--model.converters="float8" \
31+
--float8.enable_fsdp_float8_all_gather \
32+
--float8.precompute_float8_dynamic_scale_for_fsdp \
33+
--float8.force_recompute_fp8_weight_in_bwd \
34+
--profiling.profile_freq 1000000
35+
--training.steps 2000
36+
```
37+
38+
### Results
39+
40+
Detailed performance results and training configurations can be found in the tables below along and can visualized in [this WandB report](https://api.wandb.ai/links/asaiacai/w4c46stp). `TPS` and `Memory(GiB)` are arbitrarily sampled at the 100th iteration:
41+
42+
| NUM_NODES | TPS/GPU | Memory(GiB) |
43+
| ----- | ----: | ----: |
44+
| 4 | 10938 | 47.96 |
45+
| 8 | 10753 | 46.97 |
46+
47+
48+
### Versions and Dates
49+
50+
| repo | commit | date |
51+
| --- | --- | --- |
52+
| torch | [2.8.0a0+5228986c39](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-05.html) | 2025/05/29 |
53+
| torchao | [0afa4c1](https://github.com/pytorch/ao/commit/0afa4c1bd28c82921e360ddbd1b27c9d6da5b947) | 2025/06/13 |
54+
| torchtitan | [e7c0cae](https://github.com/pytorch/torchtitan/commit/e7c0cae934df78d6e9c2835f42ff1f757dc3fddc) | 2025/06/13 |

0 commit comments

Comments
 (0)