-
Couldn't load subscription status.
- Fork 646
Deepspeed
You can also train with Microsoft Deepspeed's Sparse Attention, with any combination of dense and sparse attention that you'd like. However, you will have to endure the installation process.
- llvm-9-dev
- cmake
- gcc
- python3.7.x
- cudatoolkit=10.1
- pytorch=1.6.*
sudo apt-get -y install llvm-9-dev cmake
git clone https://github.com/microsoft/DeepSpeed.git /tmp/Deepspeed
cd /tmp/Deepspeed && DS_BUILD_SPARSE_ATTN=1 ./install.sh -s # Change this to -r if you need to run as root
pip install triton
cd ~Then you may either use conda or pip:
- Conda
#!/bin/bash
conda create -n dalle_env python=3.7
conda activate dalle_env
conda install pytorch==1.6.0 torchvision==0.7.0 cudatoolkit=10.1 -c pytorch
pip install "git+https://github.com:lucidrains/DALLE-pytorch.git"- Pip
#!/bin/bash
python -m pip install virtualenv
python -m virtualenv -p=python3.7 ~/.virtualenvs/dalle_env
source ~/.virtualenvs/dalle_env/bin/activate
# Make sure your terminal shows that you're inside the virtual environment - and then run:
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
pip install "git+https://github.com:lucidrains/DALLE-pytorch.git"If everything installed correctly you now have access to a few new features:
dalle = DALLE(
dim = 512,
depth = 64,
heads = 8,
attn_types = ('full', 'sparse') # interleave sparse and dense attention for 64 layers
)You should now run all training sessions with deepspeed instead of python if you wish to make use of its distributed features.
deepspeed train_dalle.py <...> --distributed_backend deepspeed
deepspeed train_dalle.py <...> --distributed_backend --deepspeed --fp16
Zero stages 1-3 have been confirmed to work (for us) when using V100, A100, RTX3090:
To use floating-point-16, simply pass --fp16 to train_dalle.py
deepspeed train_dalle.py --image_text_folder=/path/to/your/dataset --distributed_backend --deepspeed --fp16
Stage 2 will try to use gradient_accumulate in order to fill up the VRAM of each GPU more effectively.
You may also optionally enable cpu_offload at this point in order to use the CPU-based Adam which deepspeed provides.
deepspeed_config = {
"zero_optimization": {
"stage": 2,
"cpu_offload": True
},
'train_batch_size': BATCH_SIZE,
'gradient_clipping': GRAD_CLIP_NORM,
'fp16': {
'enabled': args.fp16,
},
}