Skip to content

Commit 6fbe4e4

Browse files
authored
Documentation for Training LLMs with PyTorch DDP (#299)
2 parents 1660770 + 1004ab9 commit 6fbe4e4

File tree

17 files changed

+425
-27
lines changed

17 files changed

+425
-27
lines changed

ads/jobs/builders/runtimes/pytorch_runtime.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,8 @@ def run(self, dsc_job, **kwargs):
205205
if not envs:
206206
envs = {}
207207
# Huggingface accelerate requires machine rank
208-
envs["RANK"] = str(i)
208+
# Here we use NODE_RANK to store the machine rank
209+
envs["NODE_RANK"] = str(i)
209210
envs["WORLD_SIZE"] = str(replicas)
210211
if main_run:
211212
envs["MAIN_JOB_RUN_OCID"] = main_run.id

ads/jobs/templates/driver_pytorch.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -694,7 +694,7 @@ def __init__(self, code_dir: str = driver_utils.DEFAULT_CODE_DIR) -> None:
694694
# --multi_gpu will be set automatically if there is more than 1 GPU
695695
# self.multi_gpu = bool(self.node_count > 1 or self.gpu_count > 1)
696696
self.num_machines = self.node_count
697-
self.machine_rank = os.environ["RANK"]
697+
self.machine_rank = os.environ["NODE_RANK"]
698698
# Total number of processes across all nodes
699699
# Here we assume all nodes are having the same shape
700700
self.num_processes = (self.gpu_count if self.gpu_count else 1) * self.node_count

ads/jobs/templates/driver_utils.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -276,7 +276,7 @@ def copy_inputs(mappings: dict = None):
276276
return
277277

278278
for src, dest in mappings.items():
279-
logger.debug("Copying %s to %s", src, dest)
279+
logger.debug("Copying %s to %s", src, os.path.abspath(dest))
280280
# Create the dest dir if one does not exist.
281281
if str(dest).endswith("/"):
282282
dest_dir = dest
@@ -439,6 +439,10 @@ def install_pip_packages(self, packages: str = None):
439439
packages = os.environ.get(CONST_ENV_PIP_PKG)
440440
if not packages:
441441
return self
442+
# The package requirement may contain special character like '>'.
443+
# Here we wrap each package requirement with single quote to make sure they can be installed correctly
444+
package_list = shlex.split(packages)
445+
packages = " ".join([f"'{package}'" for package in package_list])
442446
self.run_command(
443447
f"pip install {packages}", conda_prefix=self.conda_prefix, check=True
444448
)

docs/source/user_guide/jobs/data_science_job.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ is available on `Data Science AI Sample GitHub Repository <https://github.com/or
3535
For more details, see :doc:`infra_and_runtime` configurations.
3636
You can also :doc:`run_notebook`, :doc:`run_script` and :doc:`run_git`.
3737

38+
.. _yaml:
3839

3940
YAML
4041
====

docs/source/user_guide/jobs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Data Science Jobs
1515
../jobs/run_script
1616
../jobs/run_container
1717
../jobs/run_git
18+
../jobs/run_pytorch_ddp
1819
../cli/opctl/_template/jobs
1920
../cli/opctl/_template/monitoring
2021
../cli/opctl/localdev/local_jobs

docs/source/user_guide/jobs/infra_and_runtime.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -253,6 +253,9 @@ Here are a few more examples:
253253

254254
.. include:: ../jobs/tabs/runtime_args.rst
255255

256+
257+
.. _conda_environment:
258+
256259
Conda Environment
257260
-----------------
258261

docs/source/user_guide/jobs/run_python.rst

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -14,16 +14,6 @@ Here is an example to define and run a job using :py:class:`~ads.jobs.PythonRunt
1414

1515
.. include:: ../jobs/tabs/python_runtime.rst
1616

17-
.. code-block:: python
18-
19-
# Create the job on OCI Data Science
20-
job.create()
21-
# Start a job run
22-
run = job.run()
23-
# Stream the job run outputs
24-
run.watch()
25-
26-
2717
The :py:class:`~ads.jobs.PythonRuntime` uses an driver script from ADS for the job run.
2818
It performs additional operations before and after invoking your code.
2919
You can examine the driver script by downloading the job artifact from the OCI Console.
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
Train PyTorch Models
2+
********************
3+
4+
.. versionadded:: 2.8.8
5+
6+
The :py:class:`~ads.jobs.PyTorchDistributedRuntime` is designed for training PyTorch models, including large language models (LLMs) with multiple GPUs from multiple nodes. If you develop you training code that is compatible with `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`_, `DeepSpeed <https://www.deepspeed.ai/>`_, or `Accelerate <https://huggingface.co/docs/accelerate/index>`_, you can run them using OCI Data Science Jobs with zero code change. For multi-node training, ADS will launch multiple job runs, each corresponding to one node.
7+
8+
See `Distributed Data Parallel in PyTorch <https://pytorch.org/tutorials/beginner/ddp_series_intro.html>`_ for a series of tutorials on PyTorch distributed training.
9+
10+
.. admonition:: Prerequisite
11+
:class: note
12+
13+
You need oracle-ads\>=2.8.8 to create a job with :py:class:`~ads.jobs.PyTorchDistributedRuntime`.
14+
15+
You also need to specify a conda environment with PyTorch\>=1.10 and oracle-ads\>=2.6.8 for the job. See the :ref:`Conda Environment <conda_environment>` about specifying the conda environment for a job.
16+
17+
We recommend using the ``pytorch20_p39_gpu_v1`` service conda environment and add additional packages as needed.
18+
19+
You need to specify a subnet ID and allow ingress traffic within the subnet.
20+
21+
22+
Torchrun Example
23+
================
24+
25+
Here is an example to train a GPT model using the source code directly from the official PyTorch Examples Github repository. See `Training "Real-World" models with DDP <https://pytorch.org/tutorials/intermediate/ddp_series_minGPT.html>`_ tutorial for a walkthrough of the source code.
26+
27+
.. include:: ../jobs/tabs/pytorch_ddp_torchrun.rst
28+
29+
.. include:: ../jobs/tabs/run_job.rst
30+
31+
32+
Source Code
33+
===========
34+
35+
The source code location can be specified as Git repository, local path or remote URI supported by
36+
`fsspec <https://filesystem-spec.readthedocs.io/en/latest/>`_.
37+
38+
You can use the :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_git` method to specify the source code ``url`` on a Git repository. You can optionally specify the ``branch`` or ``commit`` for checking out the source code.
39+
40+
For a public repository, we recommend the "http://" or "https://" URL.
41+
Authentication may be required for the SSH URL even if the repository is public.
42+
43+
To use a private repository, you must first save an SSH key to
44+
`OCI Vault <https://docs.oracle.com/en-us/iaas/Content/KeyManagement/Concepts/keyoverview.htm>`_ as a secret,
45+
and provide the ``secret_ocid`` when calling :py:meth:`~ads.jobs.GitPythonRuntime.with_source`.
46+
For more information about creating and using secrets,
47+
see `Managing Secret with Vault <https://docs.oracle.com/en-us/iaas/Content/KeyManagement/Tasks/managingsecrets.htm>`_.
48+
For repository on GitHub, you could setup the
49+
`GitHub Deploy Key <https://docs.github.com/en/developers/overview/managing-deploy-keys#deploy-keys>`_ as secret.
50+
51+
.. admonition:: Git Version for Private Repository
52+
:class: note
53+
54+
Git version of 2.3+ is required to use a private repository.
55+
56+
Alternatively, you can use the :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_source` method to specify the source code as e a local path or a remote URI supported by
57+
`fsspec <https://filesystem-spec.readthedocs.io/en/latest/>`_.
58+
For example, you can specify files on OCI object storage using URI like
59+
``oci://bucket@namespace/path/to/prefix``. ADS will use the authentication method configured by
60+
:py:meth:`ads.set_auth()` to fetch the files and upload them as job artifact. The source code can be a single file, a compressed file/archive (zip/tar), or a folder.
61+
62+
Working Directory
63+
=================
64+
65+
The default working directory depends on how the source code is specified.
66+
* When the source code is specified as Git repository URL, the default working directory is the root of the Git repository.
67+
* When the source code is a single file (script), the default working directory containing the file.
68+
* When the source code is specified as a local or remote directory, the default working directory is the directory containing the source code directory.
69+
70+
The working directory of your workload can be configured by :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_working_dir`. See :ref:`Python Runtime Working Directory <runtime_working_dir>` for more details.
71+
72+
Input Data
73+
==========
74+
75+
You can specify the input (training) data for the job using the :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_inputs` method, which takes a dictionary mapping the "source" to the "destination". The "source" can be an OCI object storage URI, HTTP or FTP URL. The "destination" is the local path in a job run. If the "destination" is specified as relative path, it will be relative to the working directory.
76+
77+
Outputs
78+
=======
79+
80+
You can specify the output data to be copied to the object storage by using the :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_output` method.
81+
It allows you to specify the output path ``output_path``
82+
in the job run and a remote URI (``output_uri``).
83+
Files in the ``output_path`` are copied to the remote output URI after the job run finishes successfully.
84+
Note that the ``output_path`` should be a path relative to the working directory.
85+
86+
OCI object storage location can be specified in the format of ``oci://bucket_name@namespace/path/to/dir``.
87+
Please make sure you configure the I AM policy to allow the job run dynamic group to use object storage.
88+
89+
Number of nodes
90+
===============
91+
92+
The :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_replica` method helps you to specify the number node for the training job.
93+
94+
Command
95+
=======
96+
97+
The command to start your workload is specified by using the :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_command` method.
98+
99+
For ``torchrun``, ADS will set ``--nnode``, ``--nproc_per_node``, ``--rdzv_backend`` and ``--rdzv_endpoint`` automatically. You do not need to specify them in the command unless you would like to override the values. The default ``rdzv_backend`` will be ``c10d``. The default port for ``rdzv_endpoint`` is 29400
100+
101+
If you workload uses Deepspeed, you also need to set ``use_deepspeed`` to ``True`` when specifying the command. For Deepspeed, ADS will generate the hostfile automatically and setup the SSH configurations.
102+
103+
For ``accelerate launch``, you can add your config YAML to the source code and specify it using ``--config_file`` argument. In your config, please use ``LOCAL_MACHINE`` as the compute environment. The same config file will be used by all nodes in multi-node workload. ADS will set ``--num_processes``, ``--num_machines``, ``--machine_rank``, ``--main_process_ip`` and ``--main_process_port`` automatically. For these arguments, ADS will override the values from your config YAML. If you would like to use your own values, you need to specify them as command arguments. The default ``main_process_port`` is 29400.
104+
105+
Additional dependencies
106+
=======================
107+
108+
The :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_dependency` method helps you to specify additional dependencies to be installed into the conda environment before starting your workload.
109+
* ``pip_req`` specifies the path of the ``requirements.txt`` file in your source code.
110+
* ``pip_pkg`` specifies the packages to be installed as a string.
111+
112+
Python Paths
113+
============
114+
115+
The working directory is added to the Python paths automatically.
116+
You can call :py:meth:`~ads.jobs.PyTorchDistributedRuntime.with_python_path` to add additional python paths as needed.
117+
The paths should be relative paths from the working directory.
118+
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
.. tabs::
2+
3+
.. code-tab:: python
4+
:caption: Python
5+
6+
from ads.jobs import Job, DataScienceJob, PyTorchDistributedRuntime
7+
8+
job = (
9+
Job(name="LLAMA2-Fine-Tuning")
10+
.with_infrastructure(
11+
DataScienceJob()
12+
.with_log_group_id("<log_group_ocid>")
13+
.with_log_id("<log_ocid>")
14+
.with_compartment_id("<compartment_ocid>")
15+
.with_project_id("<project_ocid>")
16+
.with_subnet_id("<subnet_ocid>")
17+
.with_shape_name("VM.GPU.A10.1")
18+
.with_block_storage_size(256)
19+
)
20+
.with_runtime(
21+
PyTorchDistributedRuntime()
22+
# Specify the service conda environment by slug name.
23+
.with_service_conda("pytorch20_p39_gpu_v1")
24+
.with_git(
25+
url="https://github.com/facebookresearch/llama-recipes.git",
26+
commit="03faba661f079ee1ecaeb66deaa6bdec920a7bab"
27+
)
28+
.with_dependency(
29+
pip_pkg=" ".join([
30+
"'accelerate>=0.21.0'",
31+
"appdirs",
32+
"loralib",
33+
"bitsandbytes==0.39.1",
34+
"black",
35+
"'black[jupyter]'",
36+
"datasets",
37+
"fire",
38+
"'git+https://github.com/huggingface/peft.git'",
39+
"'transformers>=4.31.0'",
40+
"sentencepiece",
41+
"py7zr",
42+
"scipy",
43+
"optimum"
44+
])
45+
)
46+
.with_output("/home/datascience/outputs", "oci://bucket@namespace/outputs/$JOB_RUN_OCID")
47+
.with_command(" ".join([
48+
"torchrun llama_finetuning.py",
49+
"--enable_fsdp",
50+
"--pure_bf16",
51+
"--batch_size_training 1",
52+
"--micro_batch_size 1",
53+
"--model_name $MODEL_NAME",
54+
"--dist_checkpoint_root_folder /home/datascience/outputs",
55+
"--dist_checkpoint_folder fine-tuned"
56+
]))
57+
.with_replica(2)
58+
.with_environment_variable(
59+
MODEL_NAME="meta-llama/Llama-2-7b-hf",
60+
HUGGING_FACE_HUB_TOKEN="<access_token>",
61+
LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib",
62+
)
63+
)
64+
)
65+
66+
.. code-tab:: yaml
67+
:caption: YAML
68+
69+
kind: job
70+
apiVersion: v1.0
71+
spec:
72+
name: LLAMA2-Fine-Tuning
73+
infrastructure:
74+
kind: infrastructure
75+
spec:
76+
blockStorageSize: 256
77+
compartmentId: "<compartment_ocid>"
78+
logGroupId: "<log_group_id>"
79+
logId: "<log_id>"
80+
projectId: "<project_id>"
81+
subnetId: "<subnet_id>"
82+
shapeName: VM.GPU.A10.2
83+
type: dataScienceJob
84+
runtime:
85+
kind: runtime
86+
type: pyTorchDistributed
87+
spec:
88+
git:
89+
url: https://github.com/facebookresearch/llama-recipes.git
90+
commit: 03faba661f079ee1ecaeb66deaa6bdec920a7bab
91+
command: >-
92+
torchrun llama_finetuning.py
93+
--enable_fsdp
94+
--pure_bf16
95+
--batch_size_training 1
96+
--micro_batch_size 1
97+
--model_name $MODEL_NAME
98+
--dist_checkpoint_root_folder /home/datascience/outputs
99+
--dist_checkpoint_folder fine-tuned
100+
replicas: 2
101+
conda:
102+
type: service
103+
slug: pytorch20_p39_gpu_v1
104+
dependencies:
105+
pipPackages: >-
106+
'accelerate>=0.21.0'
107+
appdirs
108+
loralib
109+
bitsandbytes==0.39.1
110+
black
111+
'black[jupyter]'
112+
datasets
113+
fire
114+
'git+https://github.com/huggingface/peft.git'
115+
'transformers>=4.31.0'
116+
sentencepiece
117+
py7zr
118+
scipy
119+
optimum
120+
outputDir: /home/datascience/outputs
121+
outputUri: oci://bucket@namespace/outputs/$JOB_RUN_OCID
122+
env:
123+
- name: MODEL_NAME
124+
value: meta-llama/Llama-2-7b-hf
125+
- name: HUGGING_FACE_HUB_TOKEN
126+
value: "<access_token>"
127+
- name: LD_LIBRARY_PATH
128+
value: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib

0 commit comments

Comments
 (0)