[doc] Add hpu resource description in ray train related docs (#47241)

KepingYan · matthewdeng · web-flow · commit 8ed682326b0a · 2025-03-26T21:20:19.000Z
HPU resource is already supported in Ray, and there are many examples to
guide users to use HPU device in Ray, so this PR adds some instructions
for HPU device to the Ray Train related documents.

---------

Signed-off-by: KepingYan &lt;keping.yan@intel.com&gt;
Signed-off-by: matthewdeng &lt;matt@anyscale.com&gt;
Co-authored-by: matthewdeng &lt;matt@anyscale.com&gt;
Co-authored-by: matthewdeng &lt;matthew.j.deng@gmail.com&gt;
diff --git a/doc/source/train/common/torch-configure-run.rst b/doc/source/train/common/torch-configure-run.rst
@@ -1,5 +1,5 @@
-Configure scale and GPUs
-------------------------
+Configure scale and resources
+-----------------------------
 
 Outside of your training function, create a :class:`~ray.train.ScalingConfig` object to configure:
 
@@ -11,6 +11,7 @@ Outside of your training function, create a :class:`~ray.train.ScalingConfig` ob
     from ray.train import ScalingConfig
     scaling_config = ScalingConfig(num_workers=2, use_gpu=True)
 
+3. (Optional) :class:`resources_per_worker <ray.train.ScalingConfig>` - The resources reserved for each worker. If you want to allocate more than one CPU or GPU per training worker, or if you need to specify other accelerators, set this attribute.
 
 For more details, see :ref:`train_scaling_config`.
 
diff --git a/doc/source/train/examples/lightning/dolly_lightning_fsdp_finetuning.ipynb b/doc/source/train/examples/lightning/dolly_lightning_fsdp_finetuning.ipynb
@@ -338,7 +338,7 @@
    "source": [
     "## Fine-tune with Ray TorchTrainer\n",
     "\n",
-    "Ray TorchTrainer allows you to scale your PyTorch Lightning training workload over multiple nodes. See {ref}`Configuring Scale and GPUs <train_scaling_config>` for more details."
+    "Ray TorchTrainer allows you to scale your PyTorch Lightning training workload over multiple nodes. See {ref}`Configuring Scale and Resources <train_scaling_config>` for more details."
    ]
   },
   {
diff --git a/doc/source/train/getting-started-pytorch-lightning.rst b/doc/source/train/getting-started-pytorch-lightning.rst
@@ -7,9 +7,9 @@ This tutorial walks through the process of converting an existing PyTorch Lightn
 
 Learn how to:
 
-1. Configure the Lightning Trainer so that it runs distributed with Ray and on the correct CPU or GPU device.
+1. Configure the Lightning Trainer so that it runs distributed with Ray and on the correct CPU, GPU, or other accelerator device.
 2. Configure :ref:`training function <train-overview-training-function>` to report metrics and save checkpoints.
-3. Configure :ref:`scaling <train-overview-scaling-config>` and CPU or GPU resource requirements for a training job.
+3. Configure :ref:`scaling <train-overview-scaling-config>` and CPU, GPU, or other accelerator resource requirements for a training job.
 4. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`.
 
 Quickstart
@@ -31,7 +31,7 @@ For reference, the final code is as follows:
     result = trainer.fit()
 
 1. `train_func` is the Python code that executes on each distributed training worker.
-2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs.
+2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs or other types of accelerators.
 3. :class:`~ray.train.torch.TorchTrainer` launches the distributed training job.
 
 Compare a PyTorch Lightning training script with and without Ray Train.
diff --git a/doc/source/train/getting-started-pytorch.rst b/doc/source/train/getting-started-pytorch.rst
@@ -7,10 +7,10 @@ This tutorial walks through the process of converting an existing PyTorch script
 
 Learn how to:
 
-1. Configure a model to run distributed and on the correct CPU/GPU device.
-2. Configure a dataloader to shard data across the :ref:`workers <train-overview-worker>` and place data on the correct CPU or GPU device.
+1. Configure a model to run distributed and on the correct CPU, GPU, or other accelerator device.
+2. Configure a dataloader to shard data across the :ref:`workers <train-overview-worker>` and place data on the correct CPU, GPU, or other accelerator device.
 3. Configure a :ref:`training function <train-overview-training-function>` to report metrics and save checkpoints.
-4. Configure :ref:`scaling <train-overview-scaling-config>` and CPU or GPU resource requirements for a training job.
+4. Configure :ref:`scaling <train-overview-scaling-config>` and CPU, GPU, or other accelerator resource requirements for a training job.
 5. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer` class.
 
 Quickstart
@@ -33,7 +33,7 @@ For reference, the final code will look something like the following:
     result = trainer.fit()
 
 1. `train_func` is the Python code that executes on each distributed training worker.
-2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs.
+2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers, and whether to use CPUs, GPUs, or other types of accelerator devices.
 3. :class:`~ray.train.torch.TorchTrainer` launches the distributed training job.
 
 Compare a PyTorch training script with and without Ray Train.
diff --git a/doc/source/train/getting-started-transformers.rst b/doc/source/train/getting-started-transformers.rst
@@ -7,8 +7,8 @@ This tutorial shows you how to convert an existing Hugging Face Transformers scr
 
 In this guide, learn how to:
 
-1. Configure a :ref:`training function <train-overview-training-function>` that properly reports metrics and saves checkpoints.
-2. Configure :ref:`scaling <train-overview-scaling-config>` and resource requirements for CPUs or GPUs for your distributed training job.
+1. Configure a :ref:`training function <train-overview-training-function>` that reports metrics and saves checkpoints.
+2. Configure :ref:`scaling <train-overview-scaling-config>` and resource requirements for CPUs, GPUs or other accelerators for your distributed training job.
 3. Launch a distributed training job with :class:`~ray.train.torch.TorchTrainer`.
 
 
@@ -21,7 +21,6 @@ Install the necessary packages before you begin:
 
     pip install "ray[train]" torch "transformers[torch]" datasets evaluate numpy scikit-learn
 
-
 Quickstart
 ----------
 
@@ -44,7 +43,7 @@ Here's a quick overview of the final code structure:
 The key components are:
 
 1. `train_func`: Python code that runs on each distributed training worker.
-2. :class:`~ray.train.ScalingConfig`: Defines the number of distributed training workers and GPU usage.
+2. :class:`~ray.train.ScalingConfig`: Defines the number of distributed training workers and their CPUs, GPUs, or other types of accelerator devices.
 3. :class:`~ray.train.torch.TorchTrainer`: Launches and manages the distributed training job.
 
 Code Comparison: Hugging Face Transformers vs. Ray Train Integration
diff --git a/doc/source/train/huggingface-accelerate.rst b/doc/source/train/huggingface-accelerate.rst
@@ -205,7 +205,7 @@ Next, see these end-to-end examples below for more details:
 
 You may also find these user guides helpful:
 
-- :ref:`Configuring Scale and GPUs <train_scaling_config>`
+- :ref:`Configuring Scale and Resources <train_scaling_config>`
 - :ref:`Configuration and Persistent Storage <train-run-config>`
 - :ref:`Saving and Loading Checkpoints <train-checkpointing>`
 - :ref:`How to use Ray Data with Ray Train <data-ingest-torch>`
diff --git a/doc/source/train/user-guides/using-gpus.rst b/doc/source/train/user-guides/using-gpus.rst
@@ -1,7 +1,7 @@
 .. _train_scaling_config:
 
-Configuring Scale and GPUs
-==========================
+Configuring Scale and Resources
+===============================
 Increasing the scale of a Ray Train training run is simple and can be done in a few lines of code.
 The main interface for this is the :class:`~ray.train.ScalingConfig`,
 which configures the number of workers and the resources they should use.
@@ -176,6 +176,76 @@ in a :ref:`Ray runtime environment <runtime-environments>`:
 
     trainer = TorchTrainer(...)
 
+.. _using-other-accelerators:
+
+Using other accelerators
+------------------------
+
+Using HPUs
+~~~~~~~~~~
+
+To use HPUs, specify the HPU resources using the ``resources_per_worker`` parameter and pass it to the :class:`~ray.train.ScalingConfig`.
+In the example below, training will run on 8 HPUs (8 workers, each using one HPU).
+
+.. testcode::
+
+    from ray.train import ScalingConfig
+
+    scaling_config = ScalingConfig(
+        num_workers=8,
+        resources_per_worker={"HPU": 1}
+    )
+
+Using HPUs in the training function
+"""""""""""""""""""""""""""""""""""
+
+After you set the ``resources_per_worker`` attribute to specify the HPU resources for each worker, Ray Train can set up environment variables in your training function so that the HPUs can be detected and used.
+
+You can get the associated devices with :meth:`ray.train.torch.get_device`.
+
+.. testcode::
+
+    import torch
+    from ray.train import ScalingConfig
+    from ray.train.torch import TorchTrainer, get_device
+
+
+    def train_func():
+        device = get_device()
+        assert device == torch.device("hpu")
+
+    trainer = TorchTrainer(
+        train_func,
+        scaling_config=ScalingConfig(
+            num_workers=1,
+            resources_per_worker={"HPU": 1}
+        )
+    )
+    trainer.fit()
+
+(PyTorch) Setting the communication backend
+"""""""""""""""""""""""""""""""""""""""""""
+
+PyTorch supports a few communication backends such as MPI, Gloo and NCCL natively. Intel® Gaudi® AI accelerator support for distributed communication can be enabled using Habana Collective Communication Library (HCCL) backend. When using HPU resources, You can set HCCL as the communication backend by configuring a :class:`~ray.train.torch.TorchConfig` and passing it into the :class:`~ray.train.torch.TorchTrainer` as follows.
+
+.. testcode::
+    :hide:
+
+    num_training_workers = 1
+
+.. testcode::
+
+    from ray.train.torch import TorchConfig, TorchTrainer
+
+    trainer = TorchTrainer(
+        train_func,
+        scaling_config=ScalingConfig(
+            num_workers=num_training_workers,
+            resources_per_worker={"CPU": 1, "HPU": 1},
+        ),
+        torch_config=TorchConfig(backend="hccl"),
+    )
+
 Setting the resources per worker
 --------------------------------
 If you want to allocate more than one CPU or GPU per training worker, or if you

Original file line number	Diff line number	Diff line change
`@@ -338,7 +338,7 @@`
`338`	`338`	`"source": [`
`339`	`339`	`"## Fine-tune with Ray TorchTrainer\n",`
`340`	`340`	`"\n",`
`341`		- "Ray TorchTrainer allows you to scale your PyTorch Lightning training workload over multiple nodes. See {ref}`Configuring Scale and GPUs <train_scaling_config>` for more details."
	`341`	+ "Ray TorchTrainer allows you to scale your PyTorch Lightning training workload over multiple nodes. See {ref}`Configuring Scale and Resources <train_scaling_config>` for more details."
`342`	`342`	`]`
`343`	`343`	`},`
`344`	`344`	`{`