Deepspeed Usage

Deepspeed Sparse Attention

You can also train with Microsoft Deepspeed's Sparse Attention, with any combination of dense and sparse attention that you'd like. However, you will have to endure the installation process.

If everything installed correctly you now have access to a few new features:

Sparse Attention (CUDA 10.1 Only)

dalle = DALLE(
    dim = 512,
    depth = 64,
    heads = 8,
    attn_types = ('full', 'sparse')  # interleave sparse and dense attention for 64 layers
)

Distributed Training

Train on multiple GPUS at once

You should now run all training sessions with deepspeed instead of python if you wish to make use of its distributed features. deepspeed train_dalle.py <...> --distributed_backend deepspeed

Train with floating point 16:

deepspeed train_dalle.py <...> --distributed_backend deepspeed --fp16

Train with only 1 GPU:

deepspeed --num_gpus 1 train_dalle.py <...> --distributed_backend deepspeed

Modifying DeepSpeed behavior:

Change the deepspeed_config dictionary in train_dalle.py or train_vae.py to adjust DeepSpeed based on your setup. If you are interested in ZeRO-enabled training, see below:

FP16

To use floating-point-16, simply pass --fp16 to train_dalle.py (not available for train_vae.py)

deepspeed train_dalle.py --image_text_folder=/path/to/your/dataset --distributed_backend --deepspeed --fp16

Train with `ZeRO`

ZeRO stages 1-3 have been confirmed to work (for us) when using V100, A100, RTX3090. ZeRO currently only works with half-precision training, so you have to pass the --fp16 flag when activating it:

Stage 1

deepspeed_config = {
    "zero_optimization": {
        "stage": 1,
    },
    'train_batch_size': BATCH_SIZE,
    'gradient_clipping': GRAD_CLIP_NORM,
    'fp16': {
        'enabled': args.fp16,
    },
}

Stage 2

Stage 2 will try to use gradient_accumulate in order to fill up the VRAM of each GPU more effectively. You may also optionally enable cpu_offload at this point in order to use the CPU-based Adam which deepspeed provides.

deepspeed_config = {
    "zero_optimization": {
        "stage": 2,
        "cpu_offload": True,
    },
    [...]
}

Stage 3

deepspeed_config = {
    "zero_optimization": {
        "stage": 3,
    },
    [...]
}

lord krishna with arjun

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Deepspeed Usage

Deepspeed Sparse Attention

Sparse Attention (CUDA 10.1 Only)

Distributed Training

Train on multiple GPUS at once

Train with floating point 16:

Train with only 1 GPU:

Modifying DeepSpeed behavior:

FP16

Train with `ZeRO`

Stage 1

Stage 2

Stage 3

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Uh oh!

Deepspeed Usage

Deepspeed Sparse Attention

Sparse Attention (CUDA 10.1 Only)

Distributed Training

Train on multiple GPUS at once

Train with floating point 16:

Train with only 1 GPU:

Modifying DeepSpeed behavior:

FP16

Train with ZeRO

Stage 1

Stage 2

Stage 3

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Train with `ZeRO`