Skip to content

KempnerInstitute/tmrc

Repository files navigation

docs tests

TMRC

Transformer model research codebase

TMRC (Transformer Model Research Codebase) is a simple, explainable codebase to train transformer-based models. It was developed with simplicity and ease of modification in mind, particularly for researchers. The codebase will eventually be used to train foundation models and experiment with architectural and training modifications.

Documentation

TMRC Documentation

Installation

  • Step 1: Load required modules

    If you are using the Kempner AI cluster, load required modules:

    module load python/3.12.5-fasrc01
    module load cuda/12.4.1-fasrc01
    module load cudnn/9.1.1.17_cuda12-fasrc01 

    If you are not using the Kempner cluster, install torch and cuda dependencies following instructions on the PyTorch website. TMRC has been tested with torch 2.5.0+cu124 and Python 3.12.

  • Step 2: Create a Conda environment

    conda create -n tmrc_env python=3.12
    conda activate tmrc_env
  • Step 3: Clone the repository

    git clone git@github.com:KempnerInstitute/tmrc.git
  • Step 4: Install the package

    cd tmrc
    pip install poetry
    poetry install

Running Experiments

  • Step 1: Login to Weights & Biases to enable experiment tracking

    wandb login

Single-GPU Training

  • Step 2: Request compute resources. For example, on the Kempner AI cluster, to request an H100 80GB GPU run

    salloc --partition=kempner_h100 --account=<fairshare account> --nodes=1 --ntasks=1 --cpus-per-task=24 --mem=375G --gres=gpu:1  --time=00-07:00:00

    If you are not using the Kempner AI cluster, you can run experiments on your local machine (if you have a GPU) or on cloud services like AWS, GCP, or Azure. TMRC should automatically find the available GPU. If there are no GPUs available, it will run on CPU (though this is not recommended, since training will be prohibitively slow for any reasonable model size).

  • Step 3: Activate the Conda environment

    conda activate tmrc_env
  • Step 4: Launch training

    python src/tmrc/core/training/train.py

Multi-node multiple-GPU Training

  • Step 2: Request compute resources. For example, on the Kempner AI cluster, to request eight H100 80GB GPUs on two nodes run

    salloc --partition=kempner_h100 --account=<fairshare account> --nodes=2 --ntasks-per-node=4 --ntasks=8 --cpus-per-task=24 --mem=375G --gres=gpu:4  --time=00-07:00:00
  • Step 3: Activate the Conda environment

    conda activate tmrc_env
  • Step 4: Launch training

    srun python src/tmrc/core/training/train.py

Note

For distributed training, TMRC uses Distributed Data Parallelism (DDP) by default. For larger models, to use Fully Sharded Data Parallelism (FSDP), set distributed_strategy to fsdp in the training part of the config file or see the next section on how to have a custom config file.

Configuration

By default, the training script uses the configuration defined in configs/training/default_train_config.yaml.

To use a custom configuration file

python src/tmrc/core/training/train.py --config-name YOUR_CONFIG

Note

The --config-name parameter should be specified without the .yaml extension.

Tip

Configuration files should be placed in the configs/training/ directory. For example, if your config is named my_experiment.yaml, use --config-name my_experiment

Make sure to change the path under datasets block in the config file.

Build the documentation locally

  • Step 1: Install the required packages

    poetry install --with dev
  • Step 2: Build the documentation

    cd docs
    make html
  • Step 3: Open the documentation in your browser

    open _build/html/index.html

About

Transformer Model Research Codebase (TMRC)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 6