Skip to content

feat(torch): Updates & Stability Fixes #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 14, 2024
Merged

feat(torch): Updates & Stability Fixes #67

merged 20 commits into from
May 14, 2024

Conversation

Eta0
Copy link
Collaborator

@Eta0 Eta0 commented May 13, 2024

Updates for PyTorch, Apex, DeepSpeed, Flash Attention, and nccl-tests Base Images, Fix for Triton in torch-nightly, and Caching Improvements

What a mouthful of a title!

Updates

This change contains the following library version updates:

  • torch: 2.2.22.3.0
  • vision: 0.17.20.18.0
  • audio: 2.2.22.3.0
  • apex: 2386a91a7de60e
  • deepspeed: 0.12.60.14.2
  • flash_attn: 2.4.22.5.8
  • numpy: The latest release of numpy is installed alongside all torch builds. Previously, only the python3-numpy package from the Ubuntu distribution in use was included in the final torch images.

Additionally, the torch:nccl build with Ubuntu 20.04 × CUDA 12.2.2 now uses an updated base image featuring NCCL v2.21.5-1

Stability

  • Steps that were liable to intermittent failure due to network errors during the torch build are now retried after a random delay instead of cancelling the build run.
  • Approximately 1/3 of the available CPUs are used for concurrent compilation jobs during torch builds, instead of all CPUs.
  • Available disk space is reported prior to compiling torch, to better catch storage issues during CI.

Fixes

  • A version of triton compiled from source is now included with torch-nightly builds, since custom triton source commits are used during PyTorch development that are not necessarily compatible with the distributed versions available through package managers.
  • The newest update of deepspeed included in torch-extras has had ahead-of-time compilation for its CCL_COMM op disabled to be able to compile without oneCCL installed.

Performance

  • The main ml-containers build workflow now accepts an optional cache-key argument for more granular caching.
    • The cache-key and image-name parameters are used together to form a complete cache key for a given image.
  • Different torch/torch-extras/nightly-torch/nightly-torch-extras image flavours (corresponding to their base images) now use different cache keys.
    • This leads to a greatly improved cache hit rate, as previously, dozens of completely incompatible image builds were competing for the same cache spots.
    • Improved cache hit rates at build time should correspond to better image pull times when newer images can more often share perfectly identical layers with older images.
  • The ccache cache size was increased to 5 GiB.

@Eta0 Eta0 added the enhancement New feature or request label May 13, 2024
@Eta0 Eta0 requested a review from wbrown May 13, 2024 23:18
@Eta0 Eta0 self-assigned this May 13, 2024
@Eta0 Eta0 added the bug Something isn't working label May 13, 2024
Copy link
Collaborator

@wbrown wbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job. I especially like the stability improvements to the build process.

@wbrown wbrown merged commit 696032c into main May 14, 2024
67 of 108 checks passed
@wbrown wbrown deleted the es/torch-2.3 branch May 14, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants