Skip to content

Commit f9c5423

Browse files
committed
Update docs for training LLM with PyTorch.
1 parent 8631416 commit f9c5423

File tree

3 files changed

+7
-4
lines changed

3 files changed

+7
-4
lines changed

docs/source/user_guide/jobs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Data Science Jobs
1515
../jobs/run_script
1616
../jobs/run_container
1717
../jobs/run_git
18+
../jobs/run_pytorch_ddp
1819
../cli/opctl/_template/jobs
1920
../cli/opctl/_template/monitoring
2021
../cli/opctl/localdev/local_jobs

docs/source/user_guide/jobs/run_pytorch_ddp.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@ Train PyTorch Models
33

44
.. versionadded:: 2.8.8
55

6-
The :py:class:`~ads.jobs.PyTorchDistributedRuntime` is designed for training PyTorch models, including large language models (LLMs) with multiple GPUs from multiple nodes. If you develop you training code that is compatible with `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`_, `DeepSpeed <https://www.deepspeed.ai/>`_, or `Accelerate<https://huggingface.co/docs/accelerate/index>`_, you can run them using OCI Data Science Jobs with zero code change. For multi-node training, ADS will launch multiple job runs, each corresponding to one node.
6+
The :py:class:`~ads.jobs.PyTorchDistributedRuntime` is designed for training PyTorch models, including large language models (LLMs) with multiple GPUs from multiple nodes. If you develop you training code that is compatible with `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`_, `DeepSpeed <https://www.deepspeed.ai/>`_, or `Accelerate <https://huggingface.co/docs/accelerate/index>`_, you can run them using OCI Data Science Jobs with zero code change. For multi-node training, ADS will launch multiple job runs, each corresponding to one node.
77

8-
See `Distributed Data Parallel in PyTorch<https://pytorch.org/tutorials/beginner/ddp_series_intro.html>`_ for a series of tutorials on PyTorch distributed training.
8+
See `Distributed Data Parallel in PyTorch <https://pytorch.org/tutorials/beginner/ddp_series_intro.html>`_ for a series of tutorials on PyTorch distributed training.
99

1010
.. admonition:: Prerequisite
1111
:class: note
@@ -22,7 +22,7 @@ See `Distributed Data Parallel in PyTorch<https://pytorch.org/tutorials/beginner
2222
Torchrun Example
2323
================
2424

25-
Here is an example to train a GPT model using the source code directly from the official PyTorch Examples Github repository. See `Training "Real-World" models with DDP<https://pytorch.org/tutorials/intermediate/ddp_series_minGPT.html>`_ tutorial for a walkthrough of the source code.
25+
Here is an example to train a GPT model using the source code directly from the official PyTorch Examples Github repository. See `Training "Real-World" models with DDP <https://pytorch.org/tutorials/intermediate/ddp_series_minGPT.html>`_ tutorial for a walkthrough of the source code.
2626

2727
.. include:: ../jobs/tabs/pytorch_ddp_torchrun.rst
2828

docs/source/user_guide/model_training/training_llm.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,5 +55,7 @@ The same training script also support Parameter-Efficient Fine-Tuning (PEFT). Yo
5555

5656
.. code-block:: bash
5757
58-
torchrun llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --pure_bf16 --batch_size_training 1 --micro_batch_size 1 --model_name /home/datascience/llama --output_dir /home/datascience/outputs
58+
torchrun llama_finetuning.py --enable_fsdp --use_peft --peft_method lora \
59+
--pure_bf16 --batch_size_training 1 --micro_batch_size 1 \
60+
--model_name /home/datascience/llama --output_dir /home/datascience/outputs
5961

0 commit comments

Comments
 (0)