Skip to content

Commit e4668a1

Browse files
authored
Update distributed training user guide (#54)
2 parents be3c66a + ebdccf5 commit e4668a1

File tree

3 files changed

+28
-18
lines changed

3 files changed

+28
-18
lines changed
Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
**Test Locally:**
22

33
Before submitting the workload to jobs, you can run it locally to test your code, dependencies, configurations etc.
4-
With ``-b local`` flag, it uses a local backend. Further when you need to run this workload on odsc jobs, simply use ``-b job`` flag instead.
4+
With ``-b local`` flag, it uses a local backend. Further when you need to run this workload on OCI data science jobs, simply use ``-b job`` flag instead.
55

66
.. code-block:: bash
77
@@ -13,9 +13,10 @@ If your code requires to use any oci services (like object bucket), you need to
1313
1414
oci_key_mnt = ~/.oci:/home/oci_dist_training/.oci
1515
16-
**Submit the workload:**
17-
16+
Note that the local backend requires the source code for your workload is available locally in the source folder specified in the ``config.ini`` file.
17+
If you specified Git repository or OCI object storage location as source code location in your workflow YAML, please make sure you have a local copy available for local testing.
1818

19+
**Submit the workload:**
1920

2021
.. code-block:: bash
2122
@@ -24,22 +25,23 @@ If your code requires to use any oci services (like object bucket), you need to
2425
**Note:**: This will automatically push the docker image to the
2526
OCI `container registry repo <https://docs.oracle.com/en-us/iaas/Content/Registry/Concepts/registryoverview.htm>`_ .
2627

27-
Once running, you will see on the terminal an output similar to the below. Note that this yaml
28-
can be used as input to ``ads opctl distributed-training show-config -f <info.yaml>`` - to both
29-
save and see the run info use ``tee`` - for example:
30-
31-
.. code-block:: bash
32-
33-
ads opctl run -f train.yaml | tee info.yaml
28+
Once running, you will see on the terminal outputs similar to the below
3429

3530
.. code-block:: yaml
3631
:caption: info.yaml
3732
3833
jobId: oci.xxxx.<job_ocid>
3934
mainJobRunId:
4035
mainJobRunIdName: oci.xxxx.<job_run_ocid>
41-
workDir: oci://my-bucket@my-namespace/daskcluster-testing/005
36+
workDir: oci://my-bucket@my-namespace/cluster-testing/005
4237
otherJobRunIds:
4338
- workerJobRunIdName_1: oci.xxxx.<job_run_ocid>
4439
- workerJobRunIdName_2: oci.xxxx.<job_run_ocid>
45-
- workerJobRunIdName_3: oci.xxxx.<job_run_ocid>
40+
- workerJobRunIdName_3: oci.xxxx.<job_run_ocid>
41+
42+
This information can be saved as YAML file and used as input to ``ads opctl distributed-training show-config -f <info.yaml>``.
43+
You can use ``--job-info`` to save the job run info into YAML, for example:
44+
45+
.. code-block:: bash
46+
47+
ads opctl run -f train.yaml --job-info info.yaml

docs/source/user_guide/model_training/distributed_training/dask/creating.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ To view the logs from a job run, you could run -
230230
231231
ads opctl watch oci.xxxx.<job_run_ocid>
232232
233-
You could stream the logs from any of the job run ocid using ``ads opctl watch`` command. Your could run this comand from mutliple terminal to watch all of the job runs. Typically, watching ``mainJobRunId`` should yeild most informative log.
233+
You could stream the logs from any of the job run ocid using ``ads opctl watch`` command. You could run this command from multiple terminal to watch all of the job runs. Typically, watching ``mainJobRunId`` should yield most informative log.
234234

235235
To find the IP address of the scheduler dashboard, you could check the configuration file generated by the Main job by running -
236236

docs/source/user_guide/model_training/distributed_training/pytorch/creating.rst

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Creating PyTorch Distributed Workloads
77
**Write your training code:**
88

99
For this example, the code to run was inspired from an example
10-
`found here <https://github.com/Azure/azureml-examples/blob/main/python-sdk/workflows/train/pytorch/cifar-distributed/src/train.py>`_
10+
`found here <https://github.com/Azure/azureml-examples/blob/32eeda9e9f394bd6c3b687b55e2740abc50b116c/sdk/python/jobs/single-step/pytorch/distributed-training/src/train.py>`_
1111

1212
Note that ``MASTER_ADDR``, ``MASTER_PORT``, ``WORLD_SIZE``, ``RANK``, and ``LOCAL_RANK`` are environment variables
1313
that will automatically be set.
@@ -20,7 +20,7 @@ that will automatically be set.
2020
# BSD 3-Clause License
2121
#
2222
# Script adapted from:
23-
# https://github.com/Azure/azureml-examples/blob/main/python-sdk/workflows/train/pytorch/cifar-distributed/src/train.py
23+
# https://github.com/Azure/azureml-examples/blob/32eeda9e9f394bd6c3b687b55e2740abc50b116c/sdk/python/jobs/single-step/pytorch/distributed-training/src/train.py
2424
# ==============================================================================
2525
2626
@@ -302,7 +302,7 @@ Specify image name and tag
302302
export TAG=latest
303303
304304
305-
Build the container image.
305+
Build the container image
306306

307307
.. code-block:: bash
308308
@@ -318,7 +318,7 @@ The code is assumed to be in the current working directory. To override the sour
318318
ads opctl distributed-training build-image \
319319
-t $TAG \
320320
-reg $IMAGE_NAME \
321-
-df oci_dist_training_artifacts/horovod/v1/oci_dist_training_artifacts/pytorch/v1/Dockerfile
321+
-df oci_dist_training_artifacts/pytorch/v1/Dockerfile
322322
-s <code_dir>
323323
324324
If you are behind proxy, ads opctl will automatically use your proxy settings (defined via ``no_proxy``, ``http_proxy`` and ``https_proxy``).
@@ -397,7 +397,15 @@ the output from the dry run will show all the actions and infrastructure configu
397397

398398
.. include:: ../_test_and_submit.rst
399399

400-
.. _hvd_saving_artifacts:
400+
**Monitoring the workload logs**
401+
402+
To view the logs from a job run, you could run -
403+
404+
.. code-block:: bash
405+
406+
ads opctl watch oci.xxxx.<job_run_ocid>
407+
408+
You could stream the logs from any of the job run ocid using ``ads opctl watch`` command. You could run this command from multiple terminal to watch all of the job runs. Typically, watching ``mainJobRunId`` should yield most informative log.
401409

402410
.. include:: ../_save_artifacts.rst
403411

0 commit comments

Comments
 (0)