Transformer model research codebase
TMRC (Transformer Model Research Codebase) is a simple, explainable codebase to train transformer-based models. It was developed with simplicity and ease of modification in mind, particularly for researchers. The codebase will eventually be used to train foundation models and experiment with architectural and training modifications.
-
Step 1: Load required modules
If you are using the Kempner AI cluster, load required modules:
module load python/3.12.5-fasrc01 module load cuda/12.4.1-fasrc01 module load cudnn/9.1.1.17_cuda12-fasrc01
If you are not using the Kempner cluster, install torch and cuda dependencies following instructions on the PyTorch website. TMRC has been tested with torch
2.5.0+cu124
and Python3.12
. -
Step 2: Create a Conda environment
conda create -n tmrc_env python=3.12 conda activate tmrc_env
-
Step 3: Clone the repository
git clone git@github.com:KempnerInstitute/tmrc.git
-
Step 4: Install the package
cd tmrc pip install poetry poetry install
-
Step 1: Login to Weights & Biases to enable experiment tracking
wandb login
-
Step 2: Request compute resources. For example, on the Kempner AI cluster, to request an H100 80GB GPU run
salloc --partition=kempner_h100 --account=<fairshare account> --nodes=1 --ntasks=1 --cpus-per-task=24 --mem=375G --gres=gpu:1 --time=00-07:00:00
If you are not using the Kempner AI cluster, you can run experiments on your local machine (if you have a GPU) or on cloud services like AWS, GCP, or Azure. TMRC should automatically find the available GPU. If there are no GPUs available, it will run on CPU (though this is not recommended, since training will be prohibitively slow for any reasonable model size).
-
Step 3: Activate the Conda environment
conda activate tmrc_env
-
Step 4: Launch training
python src/tmrc/core/training/train.py
-
Step 2: Request compute resources. For example, on the Kempner AI cluster, to request eight H100 80GB GPUs on two nodes run
salloc --partition=kempner_h100 --account=<fairshare account> --nodes=2 --ntasks-per-node=4 --ntasks=8 --cpus-per-task=24 --mem=375G --gres=gpu:4 --time=00-07:00:00
-
Step 3: Activate the Conda environment
conda activate tmrc_env
-
Step 4: Launch training
srun python src/tmrc/core/training/train.py
Note
For distributed training, TMRC uses Distributed Data Parallelism (DDP)
by default. For larger models, to use Fully Sharded Data Parallelism (FSDP)
, set distributed_strategy
to fsdp
in the training
part of the config file or see the next section on how to have a custom config file.
By default, the training script uses the configuration defined in configs/training/default_train_config.yaml
.
To use a custom configuration file
python src/tmrc/core/training/train.py --config-name YOUR_CONFIG
Note
The --config-name
parameter should be specified without the .yaml
extension.
Tip
Configuration files should be placed in the configs/training/
directory. For example, if your config is named my_experiment.yaml
, use --config-name my_experiment
Make sure to change the path
under datasets
block in the config file.
-
Step 1: Install the required packages
poetry install --with dev
-
Step 2: Build the documentation
cd docs make html
-
Step 3: Open the documentation in your browser
open _build/html/index.html