Skip to content

amazon-science/collage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

preprint License: MIT

Introduction

Collage is a low-precision training strategy for large language models (LLMs). Collage makes use of multi-component floats to reduce the memory footprint during training, particularly for optimization, with purely low-precision (e.g. BFloat16) arithmetic without consorting to Float32.

It is simple to use Collage by replacing AdamW with our AdamW_collage optimizer and use different collage options, i.e., light or plus (check our paper for more details). We provide Collage training for BERT & RoBERTa and also multi-size GPTs with the NeMo Megatron framework.

Requirements

  1. python 3.8 or above
  2. transformers 4.31.0 or above
  3. pytorch 1.13.1 + CUDA 11.7 or above

We recommend using NeMo r1.22.0 with released container nemo:23.11

docker pull nvcr.io/nvidia/nemo:23.11.framework

Datasets

Please follow AWS-Neuron-Tutorials-BERT to download the tokenized wikicorpus file for BERT and RoBERTa

mkdir -p ./examples_datasets/
pushd ./examples_datasets/
aws s3 cp s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar .  --no-sign-request
tar -xf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar
rm bert_pretrain_wikicorpus_tokenized_hdf5_seqlen128.tar
aws s3 cp s3://neuron-s3/training_datasets/bert_pretrain_wikicorpus_tokenized_hdf5/bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512.tar .  --no-sign-request
tar -xf bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512.tar
rm bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512.tar
popd

Please follow AWS-Neuron-Examples-GPT to download the wikipedia dataset that is stored in s3

export DATA_DIR=./examples_datasets/gpt2
mkdir -p ${DATA_DIR} && cd ${DATA_DIR}
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.bin .  --no-sign-request
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/my-gpt2_text_document.idx .  --no-sign-request
aws s3 cp s3://neuron-s3/training_datasets/gpt/wikipedia/license.txt .  --no-sign-request

Examples

Scripts for training BERT and RoBERTa are provided in roBERTa/scripts folder. Scripts for multi-size (125M, 1.3B, 2.7B and 6.7B) GPTs can be found in NeMo-GPT/scripts/nlp_language_modeling folder.

Cite us

If you find our works helpful in your research, please consider citing the following paper:

@inproceedings{yu2024collage,
    title={Collage: Light-Weight Low-Precision Strategy for LLM Training},
    author={Yu, Tao and Gupta, Gaurav and Gopalswamy, Karthick and Mamidala, Amith and Zhou, Hao and Huynh, Jeffrey and Park, Youngsuk and Diamant, Ron and Deoras, Anoop and Huan, Luke},
    booktitle={Proceedings of the 41st International Conference on Machine Learning (ICML 2024)},
    year={2024},
    organization={PMLR}
}

License

NeMo-GPT is modified from NVIDIA NeMo, which is released under an Apache 2.0 license.

Modifications Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

This code is being released solely for academic and scientific reproducibility purposes, in support of the methods and findings described in the associated publication. Pull requests are not being accepted in order to maintain the code exactly as it was used in the paper, but interested parties are encouraged to open an issue requesting open source community development.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •