Skip to content

Commit 8ed6823

Browse files
[doc] Add hpu resource description in ray train related docs (#47241)
HPU resource is already supported in Ray, and there are many examples to guide users to use HPU device in Ray, so this PR adds some instructions for HPU device to the Ray Train related documents. --------- Signed-off-by: KepingYan <keping.yan@intel.com> Signed-off-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
1 parent 38b67cf commit 8ed6823

File tree

7 files changed

+87
-17
lines changed

7 files changed

+87
-17
lines changed

doc/source/train/common/torch-configure-run.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Configure scale and GPUs
2-
------------------------
1+
Configure scale and resources
2+
-----------------------------
33

44
Outside of your training function, create a :class:`~ray.train.ScalingConfig` object to configure:
55

@@ -11,6 +11,7 @@ Outside of your training function, create a :class:`~ray.train.ScalingConfig` ob
1111
from ray.train import ScalingConfig
1212
scaling_config = ScalingConfig(num_workers=2, use_gpu=True)
1313

14+
3. (Optional) :class:`resources_per_worker <ray.train.ScalingConfig>` - The resources reserved for each worker. If you want to allocate more than one CPU or GPU per training worker, or if you need to specify other accelerators, set this attribute.
1415

1516
For more details, see :ref:`train_scaling_config`.
1617

doc/source/train/examples/lightning/dolly_lightning_fsdp_finetuning.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -338,7 +338,7 @@
338338
"source": [
339339
"## Fine-tune with Ray TorchTrainer\n",
340340
"\n",
341-
"Ray TorchTrainer allows you to scale your PyTorch Lightning training workload over multiple nodes. See {ref}`Configuring Scale and GPUs <train_scaling_config>` for more details."
341+
"Ray TorchTrainer allows you to scale your PyTorch Lightning training workload over multiple nodes. See {ref}`Configuring Scale and Resources <train_scaling_config>` for more details."
342342
]
343343
},
344344
{

doc/source/train/getting-started-pytorch-lightning.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ This tutorial walks through the process of converting an existing PyTorch Lightn
77

88
Learn how to:
99

10-
1. Configure the Lightning Trainer so that it runs distributed with Ray and on the correct CPU or GPU device.
10+
1. Configure the Lightning Trainer so that it runs distributed with Ray and on the correct CPU, GPU, or other accelerator device.
1111
2. Configure :ref:`training function <train-overview-training-function>` to report metrics and save checkpoints.
12-
3. Configure :ref:`scaling <train-overview-scaling-config>` and CPU or GPU resource requirements for a training job.
12+
3. Configure :ref:`scaling <train-overview-scaling-config>` and CPU, GPU, or other accelerator resource requirements for a training job.
1313
4. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer`.
1414

1515
Quickstart
@@ -31,7 +31,7 @@ For reference, the final code is as follows:
3131
result = trainer.fit()
3232

3333
1. `train_func` is the Python code that executes on each distributed training worker.
34-
2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs.
34+
2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs or other types of accelerators.
3535
3. :class:`~ray.train.torch.TorchTrainer` launches the distributed training job.
3636

3737
Compare a PyTorch Lightning training script with and without Ray Train.

doc/source/train/getting-started-pytorch.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ This tutorial walks through the process of converting an existing PyTorch script
77

88
Learn how to:
99

10-
1. Configure a model to run distributed and on the correct CPU/GPU device.
11-
2. Configure a dataloader to shard data across the :ref:`workers <train-overview-worker>` and place data on the correct CPU or GPU device.
10+
1. Configure a model to run distributed and on the correct CPU, GPU, or other accelerator device.
11+
2. Configure a dataloader to shard data across the :ref:`workers <train-overview-worker>` and place data on the correct CPU, GPU, or other accelerator device.
1212
3. Configure a :ref:`training function <train-overview-training-function>` to report metrics and save checkpoints.
13-
4. Configure :ref:`scaling <train-overview-scaling-config>` and CPU or GPU resource requirements for a training job.
13+
4. Configure :ref:`scaling <train-overview-scaling-config>` and CPU, GPU, or other accelerator resource requirements for a training job.
1414
5. Launch a distributed training job with a :class:`~ray.train.torch.TorchTrainer` class.
1515

1616
Quickstart
@@ -33,7 +33,7 @@ For reference, the final code will look something like the following:
3333
result = trainer.fit()
3434

3535
1. `train_func` is the Python code that executes on each distributed training worker.
36-
2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers and whether to use GPUs.
36+
2. :class:`~ray.train.ScalingConfig` defines the number of distributed training workers, and whether to use CPUs, GPUs, or other types of accelerator devices.
3737
3. :class:`~ray.train.torch.TorchTrainer` launches the distributed training job.
3838

3939
Compare a PyTorch training script with and without Ray Train.

doc/source/train/getting-started-transformers.rst

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ This tutorial shows you how to convert an existing Hugging Face Transformers scr
77

88
In this guide, learn how to:
99

10-
1. Configure a :ref:`training function <train-overview-training-function>` that properly reports metrics and saves checkpoints.
11-
2. Configure :ref:`scaling <train-overview-scaling-config>` and resource requirements for CPUs or GPUs for your distributed training job.
10+
1. Configure a :ref:`training function <train-overview-training-function>` that reports metrics and saves checkpoints.
11+
2. Configure :ref:`scaling <train-overview-scaling-config>` and resource requirements for CPUs, GPUs or other accelerators for your distributed training job.
1212
3. Launch a distributed training job with :class:`~ray.train.torch.TorchTrainer`.
1313

1414

@@ -21,7 +21,6 @@ Install the necessary packages before you begin:
2121
2222
pip install "ray[train]" torch "transformers[torch]" datasets evaluate numpy scikit-learn
2323
24-
2524
Quickstart
2625
----------
2726

@@ -44,7 +43,7 @@ Here's a quick overview of the final code structure:
4443
The key components are:
4544

4645
1. `train_func`: Python code that runs on each distributed training worker.
47-
2. :class:`~ray.train.ScalingConfig`: Defines the number of distributed training workers and GPU usage.
46+
2. :class:`~ray.train.ScalingConfig`: Defines the number of distributed training workers and their CPUs, GPUs, or other types of accelerator devices.
4847
3. :class:`~ray.train.torch.TorchTrainer`: Launches and manages the distributed training job.
4948

5049
Code Comparison: Hugging Face Transformers vs. Ray Train Integration

doc/source/train/huggingface-accelerate.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,7 @@ Next, see these end-to-end examples below for more details:
205205

206206
You may also find these user guides helpful:
207207

208-
- :ref:`Configuring Scale and GPUs <train_scaling_config>`
208+
- :ref:`Configuring Scale and Resources <train_scaling_config>`
209209
- :ref:`Configuration and Persistent Storage <train-run-config>`
210210
- :ref:`Saving and Loading Checkpoints <train-checkpointing>`
211211
- :ref:`How to use Ray Data with Ray Train <data-ingest-torch>`

doc/source/train/user-guides/using-gpus.rst

Lines changed: 72 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
.. _train_scaling_config:
22

3-
Configuring Scale and GPUs
4-
==========================
3+
Configuring Scale and Resources
4+
===============================
55
Increasing the scale of a Ray Train training run is simple and can be done in a few lines of code.
66
The main interface for this is the :class:`~ray.train.ScalingConfig`,
77
which configures the number of workers and the resources they should use.
@@ -176,6 +176,76 @@ in a :ref:`Ray runtime environment <runtime-environments>`:
176176

177177
trainer = TorchTrainer(...)
178178

179+
.. _using-other-accelerators:
180+
181+
Using other accelerators
182+
------------------------
183+
184+
Using HPUs
185+
~~~~~~~~~~
186+
187+
To use HPUs, specify the HPU resources using the ``resources_per_worker`` parameter and pass it to the :class:`~ray.train.ScalingConfig`.
188+
In the example below, training will run on 8 HPUs (8 workers, each using one HPU).
189+
190+
.. testcode::
191+
192+
from ray.train import ScalingConfig
193+
194+
scaling_config = ScalingConfig(
195+
num_workers=8,
196+
resources_per_worker={"HPU": 1}
197+
)
198+
199+
Using HPUs in the training function
200+
"""""""""""""""""""""""""""""""""""
201+
202+
After you set the ``resources_per_worker`` attribute to specify the HPU resources for each worker, Ray Train can set up environment variables in your training function so that the HPUs can be detected and used.
203+
204+
You can get the associated devices with :meth:`ray.train.torch.get_device`.
205+
206+
.. testcode::
207+
208+
import torch
209+
from ray.train import ScalingConfig
210+
from ray.train.torch import TorchTrainer, get_device
211+
212+
213+
def train_func():
214+
device = get_device()
215+
assert device == torch.device("hpu")
216+
217+
trainer = TorchTrainer(
218+
train_func,
219+
scaling_config=ScalingConfig(
220+
num_workers=1,
221+
resources_per_worker={"HPU": 1}
222+
)
223+
)
224+
trainer.fit()
225+
226+
(PyTorch) Setting the communication backend
227+
"""""""""""""""""""""""""""""""""""""""""""
228+
229+
PyTorch supports a few communication backends such as MPI, Gloo and NCCL natively. Intel® Gaudi® AI accelerator support for distributed communication can be enabled using Habana Collective Communication Library (HCCL) backend. When using HPU resources, You can set HCCL as the communication backend by configuring a :class:`~ray.train.torch.TorchConfig` and passing it into the :class:`~ray.train.torch.TorchTrainer` as follows.
230+
231+
.. testcode::
232+
:hide:
233+
234+
num_training_workers = 1
235+
236+
.. testcode::
237+
238+
from ray.train.torch import TorchConfig, TorchTrainer
239+
240+
trainer = TorchTrainer(
241+
train_func,
242+
scaling_config=ScalingConfig(
243+
num_workers=num_training_workers,
244+
resources_per_worker={"CPU": 1, "HPU": 1},
245+
),
246+
torch_config=TorchConfig(backend="hccl"),
247+
)
248+
179249
Setting the resources per worker
180250
--------------------------------
181251
If you want to allocate more than one CPU or GPU per training worker, or if you

0 commit comments

Comments
 (0)