Cuda update policy and guide #74

atalman · 2025-05-17T20:19:14Z

This PR introduces CUDA update policy followed by PyTorch team and consolidates CUDA update guilde from pytorch/builder repo:
https://github.com/pytorch/builder/blob/main/CUDA_UPGRADE_GUIDE.MD

Related RFCs for CUDA update:
pytorch/pytorch#147383
pytorch/pytorch#145544
pytorch/pytorch#138609
pytorch/pytorch#134015
pytorch/pytorch#123456

ZainRizvi · 2025-05-19T15:30:37Z

RFC-0039-cuda-support.md

+- Aligned decision policy provides transparent decision making;
+- Incorporate decision points in the early release process to stagger CUDA migration from the RC release integration work.
+
+## **Proposed Policy and Process**


Since this is an existing policy, in the past year when has this resulted in us deciding to not do a cuda upgrade?

Yes, this is a history of decisions we took for past year #74 (comment)

We supported CUDA: 11.8, 12.1, 12.4, 12.6 and 12.8 .

RFC-0039-cuda-support.md

ZainRizvi · 2025-05-19T15:34:39Z

RFC-0039-cuda-support.md

+### Detailed Process of Introducing new CUDA version
+
+1. Evaluate CUDA update necessity
+  Goal: When Any of this is true:


Are there examples of cuda version updates that do not meet this criteria?

The hard decisions are going to be when someone wants to argue that a upgrade does NOT meet the bar, so having real-world examples of those will help guide those future discussions

Yes, I believe the history for the updates are here:
pytorch/pytorch#147383
pytorch/pytorch#145544
pytorch/pytorch#138609
pytorch/pytorch#134015
pytorch/pytorch#123456

ZainRizvi · 2025-05-19T15:37:27Z

RFC-0039-cuda-support.md

+
+1. Evaluate deprecation of legacy CUDA version from CI/CD
+When: We completed CUDA update and previous experimental CUDA version is qualified to be stable and we have 3 supported versions (legacy, stable and experimental)
+Goal: Make sure we support 2 versions of CUDA, supporting 3 versions can be an exception for certain release where we need to keep legacy version


supporting 3 versions can be an exception for certain release where we need to keep legacy version

Can you share examples of when we have needed to keep a legacy version around in the past? Sounds like we might incur a significant CI cost to test 50% more CUDA versions

Yes this is current situation with CUDA 11.8: pytorch/pytorch#147383

Let's add this as an example here

ZainRizvi · 2025-05-19T15:38:06Z

RFC-0039-cuda-support.md

+
+2. Deprecate legacy CUDA version from CI/CD
+When: Evaluate deprecation of legacy CUDA version from CI/CD is complete
+Goal: Support for legacy CUDA versions is dropped, starting from PyTorch Domain Libraries and then in PyTorch core. First we drop CD support and then CI support.


First we drop CD support and then CI support.

Is the order important?

Yes, dropping CD support is easy to do. When Dropping CI support is most likely a migration of these CI jobs to next CUDA version and may take somewhat longer.

albanD · 2025-05-20T16:10:20Z

RFC-0039-cuda-support.md

+As soon as we introduce a new Experimental Version we should consider moving the previous Experimental Version to Stable, and decommission the previous Stable version. Typically we want to support 2-3 versions of CUDA as follows:
+
+- Optional Legacy Version: If we need to have 1 version for backend compatibility or to work around the current limitation. For example: CUDA older driver is incompatible with newer CUDA version
+- Stable Version: This is a stable CUDA version that is used most of the time. This is the version we want to upload to PyPI. Please note that if the stable version is equal or slightly older to the version then the version fbcode is using, we probably need to keep it in CI/CD system as Optional Legacy Version.


I don't see why we should talk about fbcode here?
If anything we can have partners ask for a temporary legacy version as above

albanD · 2025-05-20T16:10:42Z

RFC-0039-cuda-support.md

+
+- Optional Legacy Version: If we need to have 1 version for backend compatibility or to work around the current limitation. For example: CUDA older driver is incompatible with newer CUDA version
+- Stable Version: This is a stable CUDA version that is used most of the time. This is the version we want to upload to PyPI. Please note that if the stable version is equal or slightly older to the version then the version fbcode is using, we probably need to keep it in CI/CD system as Optional Legacy Version.
+- Latest Experimental Version: This is the latest version of CUDA that we want to support. Minimal requirement is to have it available in nightly releases (CD)


minimal requirement for what?

albanD · 2025-05-20T16:11:20Z

RFC-0039-cuda-support.md

+
+### We would deprecate version of CUDA when
+
+As soon as we introduce a new Experimental Version we should consider moving the previous Experimental Version to Stable, and decommission the previous Stable version. Typically we want to support 2-3 versions of CUDA as follows:


I would say 2 versions with an optional exception to make it clear that 3 is not the expected number.

albanD · 2025-05-20T16:12:14Z

RFC-0039-cuda-support.md

+
+### Detailed Process of Introducing new CUDA version
+
+1. Evaluate CUDA update necessity


nit: this is duplicated with the policy above, you can just point to it

albanD · 2025-05-20T16:12:54Z

RFC-0039-cuda-support.md

+  - It significantly reduces binary/memory footprint
+
+2. Evaluate if we have all packages for update
+  When: As soon as Update determined to be necessary. Start by creating RFC [issue](https://github.com/pytorch/pytorch/issues/145544) with possible CUDA matrix to support for next release.


Suggested change

When: As soon as Update determined to be necessary. Start by creating RFC [issue](https://github.com/pytorch/pytorch/issues/145544) with possible CUDA matrix to support for next release.

When: As soon as Update determined to be necessary. Start by creating RFC issue (see [example](https://github.com/pytorch/pytorch/issues/145544)) with possible CUDA matrix to support for next release.

albanD · 2025-05-20T16:14:58Z

RFC-0039-cuda-support.md

@@ -0,0 +1,197 @@
+
+# [CUDA version support]


I think there is some ambiguity below on what it means to "support" a give version of cuda.
In particular make it clear that this is always about full releases, unless specified otherwise.

I would also add that we are defining the PyTorch binary support. Source builds are most of the time supported at CUDA release.

albanD · 2025-05-20T16:15:47Z

RFC-0039-cuda-support.md

+
+1. Evaluate deprecation of legacy CUDA version from CI/CD
+When: We completed CUDA update and previous experimental CUDA version is qualified to be stable and we have 3 supported versions (legacy, stable and experimental)
+Goal: Make sure we support 2 versions of CUDA, supporting 3 versions can be an exception for certain release where we need to keep legacy version


Let's add this as an example here

RFC-0039-cuda-support.md

albanD · 2025-05-20T16:18:17Z

RFC-0039-cuda-support.md

+
+This completes CUDA and CUDNN upgrade. Congrats! PyTorch now has support for a new CUDA version and you made it happen!
+
+## Upgrade CUDNN version only


Any broad policy we want to set for cudnn upgrade?

albanD · 2025-05-20T16:19:51Z

RFC-0039-cuda-support.md

+
+## **Motivation**
+
+The proposal is to provide two main benefits


I am a bit curious why we didn't go for "prodictability" as a motivation here?
Given that cuda versions are released at a relatively stable cadence, why wouldn't we want to align to that (with some delay) to enable partners to predict when these upgrades will happen?

(note that I'm not arguing we should do this, just want to hear the reasoning on your end)

Maybe @ptrblck can also chime in for this. I believe for CUDA version updates we would like to first evaluate if update is necessary and there are no major regressions in new version of CUDA, this is why we have a delay and evaluation step. For example we added support for CUDA 12.4, 12.6 and 12.8, skipping 12.5 and 12.7 versions.

RFC-0039-cuda-support.md

ptrblck · 2025-05-21T00:50:20Z

RFC-0039-cuda-support.md

+
+As soon as we introduce a new Experimental Version we should consider moving the previous Experimental Version to Stable, and decommission the previous Stable version. Typically we want to support 2-3 versions of CUDA as follows:
+
+- Optional Legacy Version: If we need to have 1 version for backend compatibility or to work around the current limitation. For example: CUDA older driver is incompatible with newer CUDA version


For example: CUDA older driver is incompatible with newer CUDA version

This is always the case for major updates in the CUDA toolkit. We kept CUDA 11.8 alive for the entire lifetime of CUDA 12.x as an unknown number of users supposedly could not update their NVIDIA driver in 2+ years (I don't know how many users). I would suggest keeping the "legacy" version as static as possible (i.e. no cuDNN, NCCL, etc. updates) to avoid mixing the legacy stack with latest libs and to reduce the churn in keeping the stable stack up-to-date.

ptrblck · 2025-05-21T00:53:28Z

RFC-0039-cuda-support.md

+Goal: Support for legacy CUDA versions is dropped, starting from PyTorch Domain Libraries and then in PyTorch core. First we drop CD support and then CI support.
+
+
+# CUDA/CUDNN Upgrade Runbook


ptrblck · 2025-05-21T00:54:46Z

RFC-0039-cuda-support.md

+| CUDA | CUDNN | architectures supported | additional details |
+| --- | --- | --- | --- |
+| 11.8.0 | 9.1.0.70 | Kepler(3.7), Maxwell(5.0), Pascal(6.0), Volta(7.0), Turing(7.5), Ampere(8.0, 8.6), Hopper(9.0) | Legacy CUDA Release |
+| 12.6.3 | 9.5.1.17 | Maxwell(5.0), Pascal(6.0), Volta(7.0), Turing(7.5), Ampere(8.0, 8.6), Hopper(9.0)  | Stable CUDA Release |
+| 12.8.0 | 9.7.1.26 | Turing(7.5), Ampere(8.0, 8.6), Hopper(9.0), Blackwell(10.0, 12.0+PTX)  | Latest CUDA Release |
+|        | 9.8.0.87 | Turing(7.5), Ampere(8.0, 8.6), Hopper(9.0), Blackwell(10.0, 12.0+PTX)  | Latest CUDA Nightly |


We should add this matrix to the install matrix on https://pytorch.org/get-started/locally/

Lets add it to RELEASE.md as well

ptrblck · 2025-05-21T00:56:42Z

RFC-0039-cuda-support.md

+https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux_sbsa.run (aarch64)
+https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/
+
+2) CUDA is available on Docker hub images : https://hub.docker.com/r/nvidia/cuda


Do we really need this? I thought we moved to plain Linux containers and are using our install_cuda.sh script?

I believe this is still used, however should be deprecated soon once we migrate focal->jammy in our CI/CD https://github.com/pytorch/pytorch/blob/main/.ci/docker/build.sh#L73

ptrblck · 2025-05-21T01:03:08Z

RFC-0039-cuda-support.md

+4. After merging the CI PR, Please open temporary issues for new builds and tests marking them unstable, example [issue](https://github.com/pytorch/pytorch/issues/127104). These issues should be closed after few days of opening, when newly added CI jobs are constantly green.
+
+## 9. Update Linux Nvidia driver used during runner provisioning
+If linux driver update is required. The driver should be updated during the runner provisioning otherwise nightly workflows will fail with multiple Nova workflows.


NVIDIA driver updates should not be required during the lifetime of a major CUDA release as minor version compatibility will be used. If we add driver API calls, we should use cudaGetDriverEntryPoint making sure this API is available on the driver we are using. Keeping older driver on CI nodes will also give us a better signal what users would expect. Updating the driver is of course still valid, but might not be necessary in each CUDA toolkit update.

atalman added 2 commits May 17, 2025 13:13

Cuda update policy

130b1b1

fix

04831b7

facebook-github-bot added the cla signed label May 17, 2025

atalman changed the title ~~Cuda update policy~~ Cuda update policy abd guide May 17, 2025

atalman changed the title ~~Cuda update policy abd guide~~ Cuda update policy and guide May 17, 2025

ZainRizvi reviewed May 19, 2025

View reviewed changes

atalman added 3 commits May 20, 2025 07:10

changes

0be8d58

arch

a701a52

nvidia_team

912ac42

albanD reviewed May 20, 2025

View reviewed changes

ptrblck reviewed May 21, 2025

View reviewed changes

RFC-0039-cuda-support.md Show resolved Hide resolved

ptrblck reviewed May 21, 2025

View reviewed changes

comments

bfa2304


		### We would deprecate version of CUDA when

		As soon as we introduce a new Experimental Version we should consider moving the previous Experimental Version to Stable, and decommission the previous Stable version. Typically we want to support 2-3 versions of CUDA as follows:


		### Detailed Process of Introducing new CUDA version

		1. Evaluate CUDA update necessity

	When: As soon as Update determined to be necessary. Start by creating RFC [issue](https://github.com/pytorch/pytorch/issues/145544) with possible CUDA matrix to support for next release.
	When: As soon as Update determined to be necessary. Start by creating RFC issue (see [example](https://github.com/pytorch/pytorch/issues/145544)) with possible CUDA matrix to support for next release.


		This completes CUDA and CUDNN upgrade. Congrats! PyTorch now has support for a new CUDA version and you made it happen!

		## Upgrade CUDNN version only


		## Motivation

		The proposal is to provide two main benefits


		As soon as we introduce a new Experimental Version we should consider moving the previous Experimental Version to Stable, and decommission the previous Stable version. Typically we want to support 2-3 versions of CUDA as follows:

		- Optional Legacy Version: If we need to have 1 version for backend compatibility or to work around the current limitation. For example: CUDA older driver is incompatible with newer CUDA version

		Goal: Support for legacy CUDA versions is dropped, starting from PyTorch Domain Libraries and then in PyTorch core. First we drop CD support and then CI support.


		# CUDA/CUDNN Upgrade Runbook

Cuda update policy and guide #74

Are you sure you want to change the base?

Cuda update policy and guide #74

Uh oh!

Conversation

atalman commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZainRizvi May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atalman May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atalman May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atalman May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

atalman commented May 17, 2025 •

edited

Loading

ZainRizvi May 19, 2025 •

edited

Loading

atalman May 20, 2025 •

edited

Loading

atalman May 20, 2025 •

edited

Loading

atalman May 26, 2025 •

edited

Loading