feat(torch): Updates & Stability Fixes #67

Eta0 · 2024-05-13T23:18:34Z

Updates for PyTorch, Apex, DeepSpeed, Flash Attention, and `nccl-tests` Base Images, Fix for Triton in `torch-nightly`, and Caching Improvements

What a mouthful of a title!

Updates

This change contains the following library version updates:

torch: 2.2.2 → 2.3.0
vision: 0.17.2 → 0.18.0
audio: 2.2.2 → 2.3.0
apex: 2386a91 → a7de60e
deepspeed: 0.12.6 → 0.14.2
flash_attn: 2.4.2 → 2.5.8
numpy: The latest release of numpy is installed alongside all torch builds. Previously, only the python3-numpy package from the Ubuntu distribution in use was included in the final torch images.

Additionally, the torch:nccl build with Ubuntu 20.04 × CUDA 12.2.2 now uses an updated base image featuring NCCL v2.21.5-1

Stability

Steps that were liable to intermittent failure due to network errors during the torch build are now retried after a random delay instead of cancelling the build run.
Approximately 1/3 of the available CPUs are used for concurrent compilation jobs during torch builds, instead of all CPUs.
Available disk space is reported prior to compiling torch, to better catch storage issues during CI.

Fixes

A version of triton compiled from source is now included with torch-nightly builds, since custom triton source commits are used during PyTorch development that are not necessarily compatible with the distributed versions available through package managers.
The newest update of deepspeed included in torch-extras has had ahead-of-time compilation for its CCL_COMM op disabled to be able to compile without oneCCL installed.

Performance

The main ml-containers build workflow now accepts an optional cache-key argument for more granular caching.
- The cache-key and image-name parameters are used together to form a complete cache key for a given image.
Different torch/torch-extras/nightly-torch/nightly-torch-extras image flavours (corresponding to their base images) now use different cache keys.
- This leads to a greatly improved cache hit rate, as previously, dozens of completely incompatible image builds were competing for the same cache spots.
- Improved cache hit rates at build time should correspond to better image pull times when newer images can more often share perfectly identical layers with older images.
The ccache cache size was increased to 5 GiB.

[skip ci]

wbrown

Nice job. I especially like the stability improvements to the build process.

Eta0 added 20 commits April 30, 2024 13:33

build(torch): Retry apt installs on some network failures

153223b

build(torch): Scale down jobs to match about 50% of available CPUs

39cc1e1

feat(torch): Include triton in distributions if built

9aa4b00

ci(torch): Update nccl-tests base image for NCCL v2.21.5

e7f8787

feat(torch): Update torch to v2.3.0

ac2c7b8

feat(torch-extras): Update Apex to a7de60e

e565122

build(torch): Scale down jobs to match about 20% of available CPUs

3dab827

build(torch): Report storage info before compiling torch

b32af07

build(torch): Set embedded shell script flags and args correctly

9fe2db4

build(torch): Limit the scope of steps retried when installing gcc

ae9626e

build(torch): Build with newer numpy

67ff4e3

build(torch): Skip unnecessary distro python3-numpy installation

6b87059

build(torch-extras): Update DeepSpeed to v0.14.2

055ddb9

build(torch-extras): Update flash-attn to v2.5.8

812b315

ci(torch-extras): Build with flash-attn v2.5.8 on workflow calls

ae8e7cd

ci: Add optional sub-keys for image build caching

0c8ab20

[skip ci]

ci(torch): Use different build cache keys for different build flavours

142d042

ci(torch): Increase ccache size to 5 GiB

c632dec

build(torch): Raise CPU scaling limit to ~1:3

b3909d5

build(torch): Disable the CCL_COMM DeepSpeed op

6e7829c

Eta0 added the enhancement New feature or request label May 13, 2024

Eta0 requested a review from wbrown May 13, 2024 23:18

Eta0 self-assigned this May 13, 2024

Eta0 added the bug Something isn't working label May 13, 2024

wbrown approved these changes May 14, 2024

View reviewed changes

wbrown merged commit 696032c into main May 14, 2024
67 of 108 checks passed

wbrown deleted the es/torch-2.3 branch May 14, 2024 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(torch): Updates & Stability Fixes #67

feat(torch): Updates & Stability Fixes #67

Uh oh!

Eta0 commented May 13, 2024 •

edited

Loading

Uh oh!

wbrown left a comment

Uh oh!

Uh oh!

Uh oh!

feat(torch): Updates & Stability Fixes #67

feat(torch): Updates & Stability Fixes #67

Uh oh!

Conversation

Eta0 commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updates for PyTorch, Apex, DeepSpeed, Flash Attention, and nccl-tests Base Images, Fix for Triton in torch-nightly, and Caching Improvements

Updates

Stability

Fixes

Performance

Uh oh!

wbrown left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Eta0 commented May 13, 2024 •

edited

Loading

Updates for PyTorch, Apex, DeepSpeed, Flash Attention, and `nccl-tests` Base Images, Fix for Triton in `torch-nightly`, and Caching Improvements