File tree Expand file tree Collapse file tree 1 file changed +54
-0
lines changed Expand file tree Collapse file tree 1 file changed +54
-0
lines changed Original file line number Diff line number Diff line change
1
+ This was performed by Trainy team on WhiteFiber in June 2025, to get a baseline of performance
2
+ of the Trainy platform on H200s platform over multiple hosts.
3
+
4
+ ### Models
5
+
6
+ Llama 3.1 8B
7
+
8
+ ### Hardware
9
+
10
+ Each host has
11
+
12
+ - 8 NVIDIA H200 GPUs connected via NVLink.
13
+ - Hosts are inter-connected with a backend RDMA fabric with 400Gb/s (Mellanox CX-7) per GPU.
14
+
15
+ ### Configuration
16
+
17
+ Runs were invoked with the following, where ` NUM_NODES ` was ` 4 ` and ` 8 `
18
+ ```
19
+ torchrun \
20
+ --nnodes $NUM_NODES \
21
+ --nproc_per_node 8 \
22
+ --rdzv_id 101 \
23
+ --rdzv_backend c10d \
24
+ --rdzv_endpoint "$MASTER_ADDR:29500" \
25
+ torchtitan/train.py \
26
+ --job.config-file torchtitan/models/llama3/train_configs/llama3_8b.toml \
27
+ --metrics.enable_wandb \
28
+ --training.local_batch_size=2 \
29
+ --training.compile \
30
+ --model.converters="float8" \
31
+ --float8.enable_fsdp_float8_all_gather \
32
+ --float8.precompute_float8_dynamic_scale_for_fsdp \
33
+ --float8.force_recompute_fp8_weight_in_bwd \
34
+ --profiling.profile_freq 1000000
35
+ --training.steps 2000
36
+ ```
37
+
38
+ ### Results
39
+
40
+ Detailed performance results and training configurations can be found in the tables below along and can visualized in [ this WandB report] ( https://api.wandb.ai/links/asaiacai/w4c46stp ) . ` TPS ` and ` Memory(GiB) ` are arbitrarily sampled at the 100th iteration:
41
+
42
+ | NUM_NODES | TPS/GPU | Memory(GiB) |
43
+ | ----- | ----: | ----: |
44
+ | 4 | 10938 | 47.96 |
45
+ | 8 | 10753 | 46.97 |
46
+
47
+
48
+ ### Versions and Dates
49
+
50
+ | repo | commit | date |
51
+ | --- | --- | --- |
52
+ | torch | [ 2.8.0a0+5228986c39] ( https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-05.html ) | 2025/05/29 |
53
+ | torchao | [ 0afa4c1] ( https://github.com/pytorch/ao/commit/0afa4c1bd28c82921e360ddbd1b27c9d6da5b947 ) | 2025/06/13 |
54
+ | torchtitan | [ e7c0cae] ( https://github.com/pytorch/torchtitan/commit/e7c0cae934df78d6e9c2835f42ff1f757dc3fddc ) | 2025/06/13 |
You can’t perform that action at this time.
0 commit comments