Document model status (#138)

tengyifei · web-flow · commit 69a1fc10f44b · 2025-03-04T17:19:13.000-08:00
* Document model status

* Update

* Move hf instruction
diff --git a/README.md b/README.md
@@ -100,15 +100,51 @@ tp run torchprime/experimental/torchax_models/run.py global_batch_size=256
 `tp run` will broadcast the specified command to all VMs in the XPK cluster,
 which is the convention for running SPMD distributed workloads.
 
-#### Env var passed to the workload
+#### Env vars passed to the workload
 
 `tp run` will pick up these environment variables locally and proxy them
 to the distributed workload, if found:
 
 - `HF_TOKEN`: HuggingFace token
 - `XLA_IR_DEBUG`: [torch_xla debugging flag][torch_xla_debug_env]
 - `XLA_HLO_DEBUG`: [torch_xla debugging flag][torch_xla_debug_env]
-- `LIBTPU_INIT_ARGS`: xla flag
+- `LIBTPU_INIT_ARGS`: XLA flags that affect compilation and execution behavior
+
+## Model status
+
+Here are the status of various models. In general, there are five stages for
+each model:
+
+- **TODO**: We need to implement the model.
+- **Implemented**: The model runs either a training or an inference step.
+- **Optimized**: We found the best scaling configuration for the model on one or
+  more hardware. One-off performance data is available.
+- **Convergence**: We tested that the training loss converges to a reasonable
+  value, or that the loss curve tracks an existing reference if exists.
+- **Production**: Not only is the model optimized and converges, its performance
+  is also continuously monitored. This is a good state for using the model in
+  production.
+
+All implemented models will at least have unit tests to verify basic numerical
+correctness, and the convergence verification stage serves as an additional
+correctness guarantee.
+
+If a model is at least implemented, you'll also find a training recipe linked
+from the checkmark emoji in the table. If a model is optimized, you'll also find
+MFU numbers linked from the table. Note that a model may continue to receive
+ongoing optimization thereafter.
+
+| **Model**            | **Implemented**                                                        | **Optimized**                                                      | **Converges** |
+| -------------------- | ---------------------------------------------------------------------- | ------------------------------------------------------------------ | ------------- |
+| Llama 3.0 8B         | [✅](torchprime/torch_xla_models/README.md#llama-30-8b-on-v6e-256)     | [✅](torchprime/torch_xla_models/README.md#llama-30-8b-on-v6e-256) | [TODO](https://github.com/AI-Hypercomputer/torchprime/issues/90) |
+| Llama 3.1 8B         | [✅](torchprime/torch_xla_models/README.md#llama-31-8b-on-v6e-256)     | [TODO](https://github.com/AI-Hypercomputer/torchprime/issues/133)  | TODO |
+| Llama 3.1 70B        | [TODO](https://github.com/AI-Hypercomputer/torchprime/issues/17)       | TODO                                                               | TODO |
+| Llama 3.1 405B       | [✅](torchprime/torch_xla_models/README.md#llama-31-405b-on-v6e-256)   | [TODO](https://github.com/AI-Hypercomputer/torchprime/milestone/2) | TODO |
+| Mixtral 8x7B         | [✅](torchprime/torch_xla_models/README.md#mixtral-8x7b-on-v6e-256)    | [TODO](https://github.com/AI-Hypercomputer/torchprime/issues/44)   | TODO |
+| Mixtral 8x22B        | [TODO](https://github.com/AI-Hypercomputer/torchprime/issues/45)       | TODO | TODO |
+| DeepSeek V3/R1       | TODO                                                                   | TODO | TODO |
+| Stable Diffusion 2.0 | [TODO](https://github.com/AI-Hypercomputer/torchprime/issues/87)       | TODO | TODO |
+| Stable Diffusion 2.1 | [TODO](https://github.com/AI-Hypercomputer/torchprime/issues/88)       | TODO | TODO |
 
 ## Structure
 
@@ -133,31 +169,6 @@ and attributes where this model code came from, if any. This also helps to
 show case what changes we have done to make it performant on TPU. The original
 version is not expected to be run.
 
-## Run huggingface transformer models
-Torchprime supports run with huggingface models by taking advantage of `tp run`.
-To use huggingface models, you can clone
-[huggingface/transformers](https://github.com/huggingface/transformers) under
-torchprime and name it as `local_transformers`. This allows you to pick any
-branch or make code modifications in transformers for experiment:
-```
-git clone https://github.com/huggingface/transformers.git local_transformers
-```
-If huggingface transformer doesn't exist, torchprime will automatically clone
-the repo and build the docker for experiment. To switch to huggingface models,
-add flag `--use-hf` to `tp run` command:
-```
-tp run --use-hf torchprime/hf_models/train.py
-```
-
-## Run with local torch/torch_xla wheel
-Torchprime supports run with user specified torch and torch_xla wheels placed
-under `local_dist/` directory. The wheel will be automatically installed in the
-docker image when use `tp run` command. To use the wheel, add flag
-`--use-local-wheel` to `tp run` command:
-```
-tp run --use-local-wheel torchprime/hf_models/train.py
-```
-
 ## Contributing
 
 Contributions are welcome! Please feel free to submit a pull request.
@@ -192,6 +203,21 @@ ruff check [--fix]
 You can install a Ruff VSCode plugin to check errors and format files from
 the editor.
 
+## Run distributed training with local torch/torch_xla wheel
+
+Torchprime supports running with user specified torch and torch_xla wheels placed
+under `local_dist/` directory. The wheel will be automatically installed in the
+docker image when use `tp run` command. To use the wheel, add flag
+`--use-local-wheel` to `tp run` command:
+
+```sh
+tp run --use-local-wheel torchprime/hf_models/train.py
+```
+
+The wheels should be built inside a
+[PyTorch/XLA development docker image][torch_xla_dev_docker] or the PyTorch/XLA
+VSCode Dev Container to minimize compatibility issues.
+
 ## License
 
 This project is licensed under the New BSD License - see the [LICENSE](LICENSE)
@@ -205,3 +231,4 @@ For more information on PyTorch/XLA, visit the
 [xpk]: https://github.com/AI-Hypercomputer/xpk
 [torch_xla_debug_env]: https://github.com/pytorch/xla/blob/master/docs/source/learn/troubleshoot.md#environment-variables
 [hydra]: https://hydra.cc/docs/intro/
+[torch_xla_dev_docker]: https://github.com/pytorch/xla/blob/master/CONTRIBUTING.md#manually-build-in-docker-container
diff --git a/torchprime/experimental/torchax_models/README.md b/torchprime/experimental/torchax_models/README.md
@@ -40,30 +40,40 @@ pip install optax tensorflow tensorboard-plugin-profile
 pip install -e .[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
 ```
 
-## Running locally
+## Running locally on a TPU VM
 
-```bash
-python run.py --model_impl=<orig|scan|scan_manual>
+Setup environment as per [README][README-examples].
+
+```sh
+python run.py model_impl=<orig|scan|scan_manual>
 ```
 
-## Run on XPK
+### Llama 3.1 8B on v6e-8
+
+Recipe for global batch size 8, sequence length 8192.
+Expected step duration: 1.7s. MFU: 30%.
+
+```sh
+export LIBTPU_INIT_ARGS="--xla_tpu_scoped_vmem_limit_kib=98304 --xla_enable_async_all_gather=true --xla_tpu_overlap_compute_collective_tc=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true"
 
-Follow the guide in `tp use` to setup the cluster information.
+python run.py model_impl=scan tp=1 global_batch_size=8 seqlen=8192
+```
 
-Run `tp run <loal command>` to run the training command on the XPK cluster.
+## Running on a XPK cluster
 
-## Benchmarks (WIP)
+First follow the [distributed training][distributed-training] guide to setup the
+cluster information.
 
-|device| Model size | Batch size | seq length | step time | MFU | NOTEs|
-|-------| ----- | ----- | ----- | ----- | ----- | ---|
-|TPU v6e-8| 8B |        8 |      8192 | 1.7s | 30% | Scan, fsdp, host-offload|
-|TPU v6e-256 x 2| 405B | 256 | 8192 | 46.12s | 28.7% | Scan, fsdp + tp, host-offload|
+Run `tp run <local command>` to run the training command on the XPK cluster.
 
-<!-- TODO: support specifying different XLA flags -->
+### Llama 3.1 405B on 2 pods of v6e-256
 
-Llama 3.1 405B on v6e-256 x 2 command:
+Recipe for global batch size 256, sequence length 8192.
+Expected step duration: 46.12s. MFU: 28.7%.
 
 ```sh
+export LIBTPU_INIT_ARGS="--xla_tpu_scoped_vmem_limit_kib=98304 --xla_enable_async_all_gather=true --xla_tpu_overlap_compute_collective_tc=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true"
+
 tp run torchprime/experimental/torchax_models/run.py \
     global_batch_size=256 \
     model_type=405B \
@@ -74,3 +84,6 @@ tp run torchprime/experimental/torchax_models/run.py \
     tp=4 \
     unroll_layers=1
 ```
+
+[README-examples]: ../../README.md#examples
+[distributed-training]: ../../README.md#distributed-training
diff --git a/torchprime/hf_models/README.md b/torchprime/hf_models/README.md
@@ -0,0 +1,24 @@
+# Run huggingface transformer models
+
+For contributors to torchprime, `tp run` also supports running the huggingface
+trainer, for debugging and comparison. This module implements an adapter over
+the huggingface trainer.
+
+To run the huggingface trainer, you can clone
+[huggingface/transformers][hf-transformers] under the root directory of
+torchprime and name it as `local_transformers`. This allows you to pick any
+branch or make code modifications in transformers for experiment:
+
+```sh
+git clone https://github.com/huggingface/transformers.git local_transformers
+```
+
+If huggingface transformer doesn't exist, torchprime will automatically clone
+the repo and build the docker for experiment. To switch to huggingface models,
+add flag `--use-hf` to `tp run` command:
+
+```sh
+tp run --use-hf torchprime/hf_models/train.py
+```
+
+[hf-transformers]: https://github.com/huggingface/transformers
diff --git a/torchprime/torch_xla_models/README.md b/torchprime/torch_xla_models/README.md
@@ -1,50 +1,147 @@
 # torch_xla models
 
-## Features
+These models use the [torch_xla][1] framework.
 
-- Optimized for PyTorch/XLA
-- Demonstrates GSPMD parallelism
-- Supports large language models tasks
+## Running locally on a TPU VM
 
-## Running locally
+1. Setup environment as per [README][README-examples].
 
-1. Clone the repository:
+1. Export key environment variables:
 
-   ```
-   git clone https://github.com/AI-Hypercomputer/torchprime.git
-   cd torchprime
+   ```sh
+   export HF_TOKEN='... hugging face token ...'
+   export XLA_IR_DEBUG=1
+   export XLA_HLO_DEBUG=1
    ```
 
-2. Install the package:
+1. Run the trainer. The default is to train Llama 3.0 8B sharded over 4 chips.
 
+   ```sh
+   python3 torchprime/torch_xla_models/train.py
    ```
-   pip install -e .
-   ```
-
-3. Run the training script:
 
-   ```
-   XLA_IR_DEBUG=1 XLA_HLO_DEBUG=1 python3 torchprime/torch_xla_models/train.py
-   ```
+## Running on a XPK cluster
 
-## Running on XPK
+First follow the [distributed training][distributed-training] guide to setup the
+cluster information.
 
-Follow the guide in `tp use` to setup the cluster information.
+Then export key environment variables in your local environment:
 
 ```sh
 export HF_TOKEN='... hugging face token ...'
 export XLA_IR_DEBUG=1
-export XLA_HLO_DEBUG=1 
+export XLA_HLO_DEBUG=1
+```
+
+Finally pick from one of these recipes, and it will build a docker image and
+launch it on XPK.
+
+### Llama 3.0 8B on v6e-256
+
+Recipe for global batch size 256, sequence length 8192.
+Expected step duration: 1.625s. MFU: 33.53%.
 
-tp run torchprime/torch_xla_models/train.py
+```sh
+tp run torchprime/torch_xla_models/train.py \
+    model=llama-3-8b \
+    global_batch_size=256 \
+    block_size=8192 \
+    profile_step=5 \
+    ici_mesh.fsdp=256
+```
+
+Recipe for global batch size 512, sequence length 8192.
+Expected step duration: 2.991s. MFU: 36.43%.
+
+```sh
+tp run torchprime/torch_xla_models/train.py \
+    model=llama-3-8b \
+    global_batch_size=512 \
+    block_size=8192 \
+    profile_step=5 \
+    ici_mesh.fsdp=256
 ```
 
-This will build the dockerfile and launch it on XPK.
+### Llama 3.1 8B on v6e-256
 
+<!-- TODO(https://github.com/AI-Hypercomputer/torchprime/issues/135): publish perf data. -->
+
+Recipe for global batch size 512, sequence length 8192:
+
+```sh
+tp run torchprime/torch_xla_models/train.py \
+    model=llama-3.1-8b \
+    global_batch_size=512 \
+    block_size=8192 \
+    profile_step=5 \
+    ici_mesh.fsdp=256
+```
+
+### Llama 3.1 405B on v6e-256
+
+Recipe for global batch size 64, sequence length 8192.
+Expected step duration: 27.349s. MFU: 21.48%.
+
+```sh
+export LIBTPU_INIT_ARGS='--xla_tpu_enable_flash_attention=false --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true --xla_tpu_scoped_vmem_limit_kib=98304'
+
+tp run torchprime/torch_xla_models/train.py \
+    model=llama-3.1-405b \
+    global_batch_size=64 \
+    block_size=8192 \
+    profile_step=5 \
+    ici_mesh.fsdp=64 \
+    ici_mesh.tensor=4
+```
+
+### Llama 3.1 405B on 2 pods of v6e-256
+
+Recipe for global batch size 128, sequence length 8192. We need to use a larger
+dataset and profile later for longer for the DCN performance to stabilize.
+
+Expected step duration: 30.933s. MFU: 18.99%.
+
+```sh
+export LIBTPU_INIT_ARGS='--xla_enable_async_all_gather=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_fuse_all_gather=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_decompose_all_gather_einsum=true --xla_tpu_decompose_einsum_reduce_scatter=true --xla_tpu_scoped_vmem_limit_kib=98304 --xla_tpu_spmd_rng_bit_generator_unsafe=true --xla_tpu_overlap_compute_collective_tc=true --xla_tpu_use_enhanced_launch_barrier=true'
+
+tp run torchprime/torch_xla_models/train.py \
+    model=llama-3.1-405b \
+    global_batch_size=128 \
+    dcn_mesh.fsdp=2 \
+    ici_mesh.fsdp=64 \
+    ici_mesh.tensor=4 \
+    dataset_config_name=wikitext-103-raw-v1 \
+    profile_step=15 \
+    profile_duration=240000 \
+    max_steps=50 \
+    logging_steps=10
+```
+
+### Mixtral 8x7B on v6e-256
+
+<!-- TODO(https://github.com/AI-Hypercomputer/torchprime/issues/137): publish perf data -->
+
+Recipe for global batch size 512, sequence length 8192.
+
+```sh
+tp run torchprime/torch_xla_models/train.py \
+    model=mixtral-8x7b \
+    global_batch_size=512 \
+    ici_mesh.fsdp=256 \
+    dataset_config_name=wikitext-103-raw-v1 \
+    profile_step=5
+```
 
 ## Key Components
 
 - `train.py`: Main training script that sets up the model, data, and training loop
 - `configs/base.yaml`: Configuration file for the training script
-- `configs/model`: Configuration files for the training models
-- `llama/model.py`: Implementation of the Llama model
+- `configs/model`: Configuration files for models
+- `configs/model/scaling`: Configuration files for scaling the training of a model, e.g.
+  rematerialization, sharding tensors.
+- `llama/model.py`: Implementation of the Llama model family
+- `mixtral/model.py`: Implementation of the Mixtral model family
+
+[1]: https://github.com/pytorch/xla
+[README-examples]: ../../README.md#examples
+[distributed-training]: ../../README.md#distributed-training