Skip to content

feat(torch): Update PyTorch and CUDA versions #82

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 4 additions & 14 deletions .github/configurations/torch-base.yml
Original file line number Diff line number Diff line change
@@ -1,16 +1,6 @@
cuda: [ 12.4.1, 12.3.2, 12.2.2, 12.0.1, 11.8.0 ]
cuda: [ 12.6.1, 12.4.1, 12.2.2 ]
os: [ ubuntu22.04, ubuntu20.04 ]
exclude:
# Not a supported combination
- cuda: 11.8.0
os: ubuntu22.04
- cuda: 11.8.0
os: ubuntu20.04
- cuda: 12.0.1
os: ubuntu20.04
- cuda: 12.0.1
os: ubuntu22.04
include:
- torch: 2.4.0
vision: 0.19.0
audio: 2.4.0
- torch: 2.4.1
vision: 0.19.1
audio: 2.4.1
55 changes: 20 additions & 35 deletions .github/configurations/torch-nccl.yml
Original file line number Diff line number Diff line change
@@ -1,52 +1,37 @@
image:
# Ubuntu 22.04
- cuda: 12.4.1
- cuda: 12.6.1
cudnn: cudnn
os: ubuntu22.04
nccl: 2.21.5-1
nccl-tests-hash: 85f9143
- cuda: 12.3.2
cudnn: cudnn9
nccl: 2.23.4-1
nccl-tests-hash: 2ff05b2
- cuda: 12.4.1
cudnn: cudnn
os: ubuntu22.04
nccl: 2.20.3-1
nccl-tests-hash: 85f9143
nccl: 2.23.4-1
nccl-tests-hash: 2ff05b2
- cuda: 12.2.2
cudnn: cudnn8
os: ubuntu22.04
nccl: 2.19.3-1
nccl-tests-hash: 85f9143
# - cuda: 12.0.1
# cudnn: cudnn8
# os: ubuntu22.04
# nccl: 2.18.5-1
# nccl-tests-hash: 85f9143
nccl: 2.23.4-1
nccl-tests-hash: 2ff05b2
# Ubuntu 20.04
- cuda: 12.4.1
- cuda: 12.6.1
cudnn: cudnn
os: ubuntu20.04
nccl: 2.21.5-1
nccl-tests-hash: 85f9143
- cuda: 12.3.2
cudnn: cudnn9
nccl: 2.23.4-1
nccl-tests-hash: 2ff05b2
- cuda: 12.4.1
cudnn: cudnn
os: ubuntu20.04
nccl: 2.20.3-1
nccl-tests-hash: 85f9143
nccl: 2.23.4-1
nccl-tests-hash: 2ff05b2
- cuda: 12.2.2
cudnn: cudnn8
os: ubuntu20.04
nccl: 2.21.5-1
nccl-tests-hash: 85f9143
# - cuda: 12.0.1
# cudnn: cudnn8
# os: ubuntu20.04
# nccl: 2.19.3-1
# nccl-tests-hash: 85f9143
# - cuda: 11.8.0
# cudnn: cudnn8
# os: ubuntu20.04
# nccl: 2.16.5-1
# nccl-tests-hash: 868dc3d
nccl-tests-hash: 2ff05b2
include:
- torch: 2.4.0
vision: 0.19.0
audio: 2.4.0
- torch: 2.4.1
vision: 0.19.1
audio: 2.4.1
15 changes: 11 additions & 4 deletions torch-extras/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
ARG BASE_IMAGE
ARG DEEPSPEED_VERSION="0.14.4"
ARG APEX_COMMIT="23c1f86520e22b505e8fdfcf6298273dff2d93d8"
ARG XFORMERS_VERSION="0.0.27.post2"
ARG XFORMERS_VERSION="0.0.28.post1"

FROM alpine/git:2.36.3 as apex-downloader
WORKDIR /git
Expand All @@ -25,8 +25,7 @@ RUN export \
CUDA_MINOR_VERSION=$(echo $CUDA_VERSION | cut -d. -f2) && \
export \
CUDA_PACKAGE_VERSION="${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" && \
#apt-get install -y --no-install-recommends \
apt-get install -y --no-install-recommends \
apt-get install -y --no-install-recommends --no-upgrade \
cuda-nvcc-${CUDA_PACKAGE_VERSION} \
cuda-nvml-dev-${CUDA_PACKAGE_VERSION} \
libcurand-dev-${CUDA_PACKAGE_VERSION} \
Expand All @@ -36,11 +35,19 @@ RUN export \
cuda-nvprof-${CUDA_PACKAGE_VERSION} \
cuda-profiler-api-${CUDA_PACKAGE_VERSION} \
cuda-nvtx-${CUDA_PACKAGE_VERSION} \
cuda-nvrtc-dev-${CUDA_PACKAGE_VERSION} \
cuda-nvrtc-dev-${CUDA_PACKAGE_VERSION} && \
apt-get -qq update && \
apt-get install -y --no-install-recommends \
libaio-dev \
ninja-build && \
apt-get clean

# Install the cuDNN dev package for building Apex
# The cuDNN runtime is installed in the base torch image
COPY --chmod=755 install_cudnn.sh /tmp/install_cudnn.sh
RUN /tmp/install_cudnn.sh "${CUDA_VERSION}" dev && \
rm /tmp/install_cudnn.sh

# Add Kitware's apt repository to get a newer version of CMake
RUN apt-get -qq update && apt-get -qq install -y \
software-properties-common lsb-release && \
Expand Down
41 changes: 41 additions & 0 deletions torch-extras/install_cudnn.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/bin/sh

CUDA_VERSION="$1";
if [ -z "$CUDA_VERSION" ]; then
exit 14;
fi;

INSTALL_DEV="$2";
if [ "$INSTALL_DEV" = "dev" ]; then
echo "Ensuring installation of cuDNN (dev)";
DEV_SUFFIX="-dev";
DEV_PREFIX="";
elif [ "$INSTALL_DEV" = "runtime" ]; then
echo "Ensuring installation of cuDNN (runtime)";
DEV_SUFFIX="";
DEV_PREFIX="lib";
else
exit 15;
fi;

CHECK_VERSION() {
dpkg-query --status "$1" 2>/dev/null \
| sed -ne 's/Version: //p' \
| grep .;
}

CUDA_MAJOR_VERSION=$(echo "$CUDA_VERSION" | cut -d. -f1);
LIBCUDNN_VER="$(
CHECK_VERSION "libcudnn8${DEV_SUFFIX}" || \
CHECK_VERSION "libcudnn9${DEV_SUFFIX}-cuda-${CUDA_MAJOR_VERSION}" || \
:;
)" || exit 16;

if [ -z "$LIBCUDNN_VER" ]; then
apt-get -qq update && \
apt-get -qq install --no-upgrade -y "${DEV_PREFIX}cudnn9-cuda-${CUDA_MAJOR_VERSION}" && \
apt-get clean && \
ldconfig;
else
echo "Found cuDNN version ${LIBCUDNN_VER}"
fi;
39 changes: 19 additions & 20 deletions torch/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# syntax=docker/dockerfile:1.4
ARG BUILDER_BASE_IMAGE="nvidia/cuda:12.0.1-devel-ubuntu22.04"
ARG FINAL_BASE_IMAGE="nvidia/cuda:12.0.1-base-ubuntu22.04"
ARG BUILDER_BASE_IMAGE="nvidia/cuda:12.4.1-devel-ubuntu22.04"
ARG FINAL_BASE_IMAGE="nvidia/cuda:12.4.1-base-ubuntu22.04"

ARG BUILD_TORCH_VERSION="2.4.0"
ARG BUILD_TORCH_VISION_VERSION="0.19.0"
ARG BUILD_TORCH_AUDIO_VERSION="2.4.0"
ARG BUILD_TORCH_VERSION="2.4.1"
ARG BUILD_TORCH_VISION_VERSION="0.19.1"
ARG BUILD_TORCH_AUDIO_VERSION="2.4.1"
ARG BUILD_TRANSFORMERENGINE_VERSION="458c7de038ed34bdaf471ced4e3162a28055def7"
ARG BUILD_FLASH_ATTN_VERSION="2.6.3"
ARG BUILD_TRITON_VERSION=""
Expand Down Expand Up @@ -59,6 +59,11 @@ RUN ./clone.sh pytorch/audio audio "${BUILD_TORCH_AUDIO_VERSION}"
# The torchaudio build requires that this directory remain a full git repository,
# so no rm -rf audio/.git is done for this one.

# torchaudio is broken for CUDA 12.5+ without this patch (as of v2.4.1)
# See https://github.com/pytorch/audio/pull/3811
COPY torchaudio-cu125-pr3811.patch /git/patch
RUN git -C audio apply --index /git/patch && rm /git/patch

FROM downloader-base as transformerengine-downloader
ARG BUILD_TRANSFORMERENGINE_VERSION
RUN ./clone.sh NVIDIA/TransformerEngine TransformerEngine "${BUILD_TRANSFORMERENGINE_VERSION}"
Expand Down Expand Up @@ -119,28 +124,19 @@ RUN apt-get -qq update && apt-get -qq install -y \
ln -s libomp.so.5 /usr/lib/x86_64-linux-gnu/libomp.so && \
ldconfig

COPY --link --chmod=755 install_cudnn.sh /tmp/install_cudnn.sh

RUN export \
CUDA_MAJOR_VERSION=$(echo $CUDA_VERSION | cut -d. -f1) \
CUDA_MINOR_VERSION=$(echo $CUDA_VERSION | cut -d. -f2) && \
CUDA_MAJOR_VERSION=$(echo "$CUDA_VERSION" | cut -d. -f1) \
CUDA_MINOR_VERSION=$(echo "$CUDA_VERSION" | cut -d. -f2) && \
export \
CUDA_PACKAGE_VERSION="${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" && \
CHECK_VERSION() { \
dpkg-query --status "$1" 2>/dev/null \
| sed -ne 's/Version: //p' \
| grep .; \
} && \
LIBCUDNN_VER="$( \
CHECK_VERSION libcudnn8-dev || \
CHECK_VERSION "libcudnn9-dev-cuda-${CUDA_MAJOR_VERSION}" || \
:; \
)" && \
apt-get -qq update && \
apt-get -qq install --no-upgrade -y \
cuda-nvtx-${CUDA_PACKAGE_VERSION} \
cuda-nvrtc-dev-${CUDA_PACKAGE_VERSION} && \
if [ -z "$LIBCUDNN_VER" ]; then \
apt-get -qq install --no-upgrade -y "cudnn9-cuda-${CUDA_MAJOR_VERSION}"; \
fi && \
/tmp/install_cudnn.sh "${CUDA_VERSION}" dev && \
rm /tmp/install_cudnn.sh && \
apt-get clean

RUN mkdir /tmp/ccache-install && \
Expand Down Expand Up @@ -510,6 +506,7 @@ ENV TORCH_VISION_VERSION=$BUILD_TORCH_VISION_VERSION
ENV TORCH_AUDIO_VERSION=$BUILD_TORCH_AUDIO_VERSION
ENV TORCH_CUDA_ARCH_LIST=$BUILD_TORCH_CUDA_ARCH_LIST

COPY --link --chmod=755 install_cudnn.sh /tmp/install_cudnn.sh
# - libnvjitlink-X-Y only exists for CUDA versions >= 12-0.
# - Don't mess with libnccl2 when using nccl-tests as a base,
# checked via the existence of the directory "/opt/nccl-tests".
Expand All @@ -534,6 +531,8 @@ RUN export \
{ if [ ! -d /opt/nccl-tests ]; then \
export NCCL_PACKAGE_VERSION="2.*+cuda${CUDA_MAJOR_VERSION}.${CUDA_MINOR_VERSION}" && \
apt-get -qq install --no-upgrade -y "libnccl2=$NCCL_PACKAGE_VERSION"; fi; } && \
/tmp/install_cudnn.sh "$CUDA_VERSION" runtime && \
rm /tmp/install_cudnn.sh && \
apt-get clean && \
ldconfig

Expand Down
41 changes: 41 additions & 0 deletions torch/install_cudnn.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/bin/sh

CUDA_VERSION="$1";
if [ -z "$CUDA_VERSION" ]; then
exit 14;
fi;

INSTALL_DEV="$2";
if [ "$INSTALL_DEV" = "dev" ]; then
echo "Ensuring installation of cuDNN (dev)";
DEV_SUFFIX="-dev";
DEV_PREFIX="";
elif [ "$INSTALL_DEV" = "runtime" ]; then
echo "Ensuring installation of cuDNN (runtime)";
DEV_SUFFIX="";
DEV_PREFIX="lib";
else
exit 15;
fi;

CHECK_VERSION() {
dpkg-query --status "$1" 2>/dev/null \
| sed -ne 's/Version: //p' \
| grep .;
}

CUDA_MAJOR_VERSION=$(echo "$CUDA_VERSION" | cut -d. -f1);
LIBCUDNN_VER="$(
CHECK_VERSION "libcudnn8${DEV_SUFFIX}" || \
CHECK_VERSION "libcudnn9${DEV_SUFFIX}-cuda-${CUDA_MAJOR_VERSION}" || \
:;
)" || exit 16;

if [ -z "$LIBCUDNN_VER" ]; then
apt-get -qq update && \
apt-get -qq install --no-upgrade -y "${DEV_PREFIX}cudnn9-cuda-${CUDA_MAJOR_VERSION}" && \
apt-get clean && \
ldconfig;
else
echo "Found cuDNN version ${LIBCUDNN_VER}"
fi;
24 changes: 24 additions & 0 deletions torch/torchaudio-cu125-pr3811.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
From 7797f83e1d66ff78872763e1da3a5fb2f0534c40 Mon Sep 17 00:00:00 2001
From: Markus Hennerbichler <markushennerbichler@gmail.com>
Date: Mon, 15 Jul 2024 14:07:13 +0100
Subject: [PATCH] Fix CUDA 12.5 build

CUDA 12.5 removed the FLT_MAX symbol.
This was previously used without being explicitly imported.
FLT_MAX is defined in <float.h>, including this header fixes the issue
---
src/libtorchaudio/cuctc/src/ctc_prefix_decoder_kernel_v2.cu | 1 +
1 file changed, 1 insertion(+)

diff --git a/src/libtorchaudio/cuctc/src/ctc_prefix_decoder_kernel_v2.cu b/src/libtorchaudio/cuctc/src/ctc_prefix_decoder_kernel_v2.cu
index 4ca8f1bf24..e6192155a2 100644
--- a/src/libtorchaudio/cuctc/src/ctc_prefix_decoder_kernel_v2.cu
+++ b/src/libtorchaudio/cuctc/src/ctc_prefix_decoder_kernel_v2.cu
@@ -24,6 +24,7 @@
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#include <algorithm>
+#include <float.h>
#include "ctc_fast_divmod.cuh"
#include "cub/cub.cuh"
#include "device_data_wrap.h"
Loading