Skip to content

Commit 84a6566

Browse files
committed
Add docs for training PyTorch models.
1 parent 638daec commit 84a6566

File tree

3 files changed

+215
-0
lines changed

3 files changed

+215
-0
lines changed
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
Train PyTorch Models
2+
********************
3+
4+
.. versionadded:: 2.8.8
5+
6+
The :py:class:`~ads.jobs.PyTorchDistributedRuntime` is designed for training PyTorch models, including large language models (LLMs) with multiple GPUs from multiple nodes. If you develop you training code that is compatible with `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`_, `DeepSpeed <https://www.deepspeed.ai/>`_, or `Accelerate<https://huggingface.co/docs/accelerate/index>`_, you can run them using OCI Data Science Jobs with zero code change. For multi-node training, ADS will launch multiple job runs, each corresponding to one node.
7+
8+
See `Distributed Data Parallel in PyTorch<https://pytorch.org/tutorials/beginner/ddp_series_intro.html>`_ for a series of tutorials on PyTorch distributed training.
9+
10+
.. admonition:: Prerequisite
11+
:class: note
12+
13+
You need oracle-ads\>=2.8.8 to create a job with :py:class:`~ads.jobs.PyTorchDistributedRuntime`.
14+
15+
You also need to specify a conda environment with PyTorch\>=1.10 and oracle-ads\>=2.6.8 for the job. See the :ref:`Conda Environment <conda_environment>` about specifying the conda environment for a job.
16+
17+
We recommend using the ``pytorch20_p39_gpu_v1`` service conda environment and add additional packages as needed.
18+
19+
You need to specify a subnet ID and allow ingress traffic within the subnet.
20+
21+
22+
Torchrun Example
23+
================
24+
25+
Here is an example to train a GPT model using the source code directly from the official PyTorch Examples Github repository. See `Training "Real-World" models with DDP<https://pytorch.org/tutorials/intermediate/ddp_series_minGPT.html>`_ tutorial for a walkthrough of the source code.
26+
27+
.. include:: ../jobs/tabs/pytorch_ddp_torchrun.rst
28+
29+
.. include:: ../jobs/tabs/run_job.rst
30+
31+
32+
Source Code
33+
===========
34+
35+
The source code location can be specified as Git repository, local path or remote URI supported by
36+
`fsspec <https://filesystem-spec.readthedocs.io/en/latest/>`_.
37+
38+
You can use the :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_git` method to specify the source code ``url`` on a Git repository. You can optionally specify the ``branch`` or ``commit`` for checking out the source code.
39+
40+
For a public repository, we recommend the "http://" or "https://" URL.
41+
Authentication may be required for the SSH URL even if the repository is public.
42+
43+
To use a private repository, you must first save an SSH key to
44+
`OCI Vault <https://docs.oracle.com/en-us/iaas/Content/KeyManagement/Concepts/keyoverview.htm>`_ as a secret,
45+
and provide the ``secret_ocid`` when calling :py:meth:`~ads.jobs.GitPythonRuntime.with_source`.
46+
For more information about creating and using secrets,
47+
see `Managing Secret with Vault <https://docs.oracle.com/en-us/iaas/Content/KeyManagement/Tasks/managingsecrets.htm>`_.
48+
For repository on GitHub, you could setup the
49+
`GitHub Deploy Key <https://docs.github.com/en/developers/overview/managing-deploy-keys#deploy-keys>`_ as secret.
50+
51+
.. admonition:: Git Version for Private Repository
52+
:class: note
53+
54+
Git version of 2.3+ is required to use a private repository.
55+
56+
Alternatively, you can use the :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_source` method to specify the source code as e a local path or a remote URI supported by
57+
`fsspec <https://filesystem-spec.readthedocs.io/en/latest/>`_.
58+
For example, you can specify files on OCI object storage using URI like
59+
``oci://bucket@namespace/path/to/prefix``. ADS will use the authentication method configured by
60+
:py:meth:`ads.set_auth()` to fetch the files and upload them as job artifact. The source code can be a single file, a compressed file/archive (zip/tar), or a folder.
61+
62+
Working Directory
63+
=================
64+
65+
The default working directory depends on how the source code is specified.
66+
* When the source code is specified as Git repository URL, the default working directory is the root of the Git repository.
67+
* When the source code is a single file (script), the default working directory directory containing the file.
68+
* When the source code is specified as a local or remote directory, the default working directory is the directory containing the source code directory.
69+
70+
The working directory of your workload can be configured by :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_working_dir`. See :ref:`Python Runtime Working Directory <runtime_working_dir>` for more details.
71+
72+
Input Data
73+
==========
74+
75+
You can specify the input (training) data for the job using the :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_inputs` method, which takes a dictionary mapping the "source" to the "destination". The "source" can be an OCI object storage URI, HTTP or FTP URL. The "destination" is the local path in a job run. If the "destination" is specified as relative path, it will be relative to the working directory.
76+
77+
Outputs
78+
=======
79+
80+
You can specify the output data to be copied to the object storage by using the :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_output` method.
81+
It allows you to specify the output path ``output_path``
82+
in the job run and a remote URI (``output_uri``).
83+
Files in the ``output_path`` are copied to the remote output URI after the job run finishes successfully.
84+
Note that the ``output_path`` should be a path relative to the working directory.
85+
86+
OCI object storage location can be specified in the format of ``oci://bucket_name@namespace/path/to/dir``.
87+
Please make sure you configure the I AM policy to allow the job run dynamic group to use object storage.
88+
89+
Number of nodes
90+
===============
91+
92+
The :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_replica` method helps you to specify the number node for the training job.
93+
94+
Command
95+
=======
96+
97+
The command to start your workload is specified by using the :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_command` method.
98+
99+
For ``torchrun``, ADS will set ``--nnode``, ``--nproc_per_node``, ``--rdzv_backend`` and ``--rdzv_endpoint`` automatically. You do not need to specify them in the command unless you would like to override the values. The default ``rdzv_backend`` will be ``c10d``. The default port for ``rdzv_endpoint`` is 29400
100+
101+
If you workload uses Deepspeed, you also need to set ``use_deepspeed`` to ``True`` when specifying the command. For Deepspeed, ADS will generate the hostfile automatically and setup the SSH configurations.
102+
103+
For ``accelerate launch``, you can add your config YAML to the source code and specify it using ``--config_file`` argument. In your config, please use ``LOCAL_MACHINE`` as the compute environment. The same config file will be used by all nodes in multi-node workload. ADS will set ``--num_processes``, ``--num_machines``, ``--machine_rank``, ``--main_process_ip`` and ``--main_process_port`` automatically. For these arguments, ADS will override the values from your config YAML. If you would like to use your own values, you need to specify them as command arguments. The default ``main_process_port`` is 29400.
104+
105+
Additional dependencies
106+
=======================
107+
108+
The :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_dependency` method helps you to specify additional dependencies to be installed into the conda environment before starting your workload.
109+
* ``pip_req`` specifies the path of the ``requirements.txt`` file in your source code.
110+
* ``pip_pkg`` specifies the packages to be installed as a string.
111+
112+
Python Paths
113+
============
114+
115+
The working directory is added to the Python paths automatically.
116+
You can call :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_python_path` to add additional python paths as needed.
117+
The paths should be relative paths from the working directory.
118+
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
.. tabs::
2+
3+
.. code-tab:: python
4+
:caption: Python
5+
6+
from ads.jobs import Job, DataScienceJob, PyTorchDistributedRuntime
7+
8+
job = (
9+
Job(name="PyTorch DDP Job")
10+
.with_infrastructure(
11+
DataScienceJob()
12+
# Configure logging for getting the job run outputs.
13+
.with_log_group_id("<log_group_ocid>")
14+
# Log resource will be auto-generated if log ID is not specified.
15+
.with_log_id("<log_ocid>")
16+
# If you are in an OCI data science notebook session,
17+
# the following configurations are not required.
18+
# Configurations from the notebook session will be used as defaults.
19+
.with_compartment_id("<compartment_ocid>")
20+
.with_project_id("<project_ocid>")
21+
.with_subnet_id("<subnet_ocid>")
22+
.with_shape_name("VM.GPU.A10.1")
23+
# Minimum/Default block storage size is 50 (GB).
24+
.with_block_storage_size(50)
25+
)
26+
.with_runtime(
27+
PyTorchDistributedRuntime()
28+
# Specify the service conda environment by slug name.
29+
.with_service_conda("pytorch20_p39_gpu_v1")
30+
.with_git(url="https://github.com/pytorch/examples.git", commit="d91085d2181bf6342ac7dafbeee6fc0a1f64dcec")
31+
.with_dependency("distributed/minGPT-ddp/requirements.txt")
32+
.with_inputs({
33+
"https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt": "data/input.txt"
34+
})
35+
.with_output("data", "oci://bucket_name@namespace/path/to/dir")
36+
.with_command("torchrun distributed/minGPT-ddp/mingpt/main.py data_config.path=data/input.txt trainer_config.snapshot_path=data/snapshot.pt")
37+
.with_replica(2)
38+
)
39+
)
40+
41+
.. code-tab:: yaml
42+
:caption: YAML
43+
44+
kind: job
45+
apiVersion: v1.0
46+
spec:
47+
name: PyTorch-MinGPT
48+
infrastructure:
49+
kind: infrastructure
50+
spec:
51+
blockStorageSize: 50
52+
compartmentId: "{{ compartment_id }}"
53+
logGroupId: "{{ log_group_id }}"
54+
logId: "{{ log_id }}"
55+
projectId: "{{ project_id }}"
56+
subnetId: "{{ subnet_id }}"
57+
shapeName: VM.GPU.A10.1
58+
type: dataScienceJob
59+
runtime:
60+
kind: runtime
61+
type: pyTorchDistributed
62+
spec:
63+
replicas: 2
64+
conda:
65+
type: service
66+
slug: pytorch110_p38_gpu_v1
67+
dependencies:
68+
pipRequirements: distributed/minGPT-ddp/requirements.txt
69+
inputs:
70+
"https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt": "data/input.txt"
71+
outputDir: data
72+
outputUri: oci://bucket_name@namespace/path/to/dir
73+
git:
74+
url: https://github.com/pytorch/examples.git
75+
commit: d91085d2181bf6342ac7dafbeee6fc0a1f64dcec
76+
command: >-
77+
torchrun distributed/minGPT-ddp/mingpt/main.py
78+
data_config.path=data/input.txt
79+
trainer_config.snapshot_path=data/snapshot.pt
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
To create and start running the job:
2+
3+
.. tabs::
4+
5+
.. code-tab:: python
6+
:caption: Python
7+
8+
# Create the job on OCI Data Science
9+
job.create()
10+
# Start a job run
11+
run = job.run()
12+
# Stream the job run outputs (from the first node)
13+
run.watch()
14+
15+
.. code-tab:: bash
16+
:caption: YAML
17+
18+
ads opctl run -f your_job.yaml

0 commit comments

Comments
 (0)