Update distributed training user guide (#54)

darenr · web-flow · commit e4668a1c6b65 · 2023-01-18T10:07:00.000-08:00
diff --git a/docs/source/user_guide/model_training/distributed_training/_test_and_submit.rst b/docs/source/user_guide/model_training/distributed_training/_test_and_submit.rst
@@ -1,7 +1,7 @@
 **Test Locally:**
 
 Before submitting the workload to jobs, you can run it locally to test your code, dependencies, configurations etc.
-With ``-b local`` flag, it uses a local backend. Further when you need to run this workload on odsc jobs, simply use ``-b job`` flag instead.
+With ``-b local`` flag, it uses a local backend. Further when you need to run this workload on OCI data science jobs, simply use ``-b job`` flag instead.
 
 .. code-block:: bash
 
@@ -13,9 +13,10 @@ If your code requires to use any oci services (like object bucket), you need to
 
   oci_key_mnt = ~/.oci:/home/oci_dist_training/.oci
 
-**Submit the workload:**
-
+Note that the local backend requires the source code for your workload is available locally in the source folder specified in the ``config.ini`` file.
+If you specified Git repository or OCI object storage location as source code location in your workflow YAML, please make sure you have a local copy available for local testing.
 
+**Submit the workload:**
 
 .. code-block:: bash
 
@@ -24,22 +25,23 @@ If your code requires to use any oci services (like object bucket), you need to
 **Note:**: This will automatically push the docker image to the
 OCI `container registry repo <https://docs.oracle.com/en-us/iaas/Content/Registry/Concepts/registryoverview.htm>`_ .
 
-Once running, you will see on the terminal an output similar to the below. Note that this yaml
-can be used as input to ``ads opctl distributed-training show-config -f <info.yaml>`` - to both
-save and see the run info use ``tee`` - for example:
-
-.. code-block:: bash
-
-  ads opctl run -f train.yaml | tee info.yaml
+Once running, you will see on the terminal outputs similar to the below
 
 .. code-block:: yaml
   :caption: info.yaml
 
   jobId: oci.xxxx.<job_ocid>
   mainJobRunId:
     mainJobRunIdName: oci.xxxx.<job_run_ocid>
-  workDir: oci://my-bucket@my-namespace/daskcluster-testing/005
+  workDir: oci://my-bucket@my-namespace/cluster-testing/005
   otherJobRunIds:
     - workerJobRunIdName_1: oci.xxxx.<job_run_ocid>
     - workerJobRunIdName_2: oci.xxxx.<job_run_ocid>
-    - workerJobRunIdName_3: oci.xxxx.<job_run_ocid>
+    - workerJobRunIdName_3: oci.xxxx.<job_run_ocid>
+
+This information can be saved as YAML file and used as input to ``ads opctl distributed-training show-config -f <info.yaml>``.
+You can use ``--job-info`` to save the job run info into YAML, for example:
+
+.. code-block:: bash
+
+  ads opctl run -f train.yaml --job-info info.yaml
diff --git a/docs/source/user_guide/model_training/distributed_training/dask/creating.rst b/docs/source/user_guide/model_training/distributed_training/dask/creating.rst
@@ -230,7 +230,7 @@ To view the logs from a job run, you could run -
 
   ads opctl watch oci.xxxx.<job_run_ocid>
 
-You could stream the logs from any of the job run ocid using ``ads opctl watch`` command. Your could run this comand from mutliple terminal to watch all of the job runs. Typically, watching ``mainJobRunId`` should yeild most informative log.
+You could stream the logs from any of the job run ocid using ``ads opctl watch`` command. You could run this command from multiple terminal to watch all of the job runs. Typically, watching ``mainJobRunId`` should yield most informative log.
 
 To find the IP address of the scheduler dashboard, you could check the configuration file generated by the Main job by running -
 
diff --git a/docs/source/user_guide/model_training/distributed_training/pytorch/creating.rst b/docs/source/user_guide/model_training/distributed_training/pytorch/creating.rst
@@ -7,7 +7,7 @@ Creating PyTorch Distributed Workloads
 **Write your training code:**
 
 For this example, the code to run was inspired from an example
-`found here <https://github.com/Azure/azureml-examples/blob/main/python-sdk/workflows/train/pytorch/cifar-distributed/src/train.py>`_
+`found here <https://github.com/Azure/azureml-examples/blob/32eeda9e9f394bd6c3b687b55e2740abc50b116c/sdk/python/jobs/single-step/pytorch/distributed-training/src/train.py>`_
 
 Note that ``MASTER_ADDR``, ``MASTER_PORT``, ``WORLD_SIZE``, ``RANK``, and ``LOCAL_RANK`` are environment variables
 that will automatically be set.
@@ -20,7 +20,7 @@ that will automatically be set.
     # BSD 3-Clause License
     #
     # Script adapted from:
-    # https://github.com/Azure/azureml-examples/blob/main/python-sdk/workflows/train/pytorch/cifar-distributed/src/train.py
+    # https://github.com/Azure/azureml-examples/blob/32eeda9e9f394bd6c3b687b55e2740abc50b116c/sdk/python/jobs/single-step/pytorch/distributed-training/src/train.py
     # ==============================================================================
 
 
@@ -302,7 +302,7 @@ Specify image name and tag
   export TAG=latest
 
 
-Build the container image.
+Build the container image
 
 .. code-block:: bash
 
@@ -318,7 +318,7 @@ The code is assumed to be in the current working directory. To override the sour
   ads opctl distributed-training build-image \
       -t $TAG \
       -reg $IMAGE_NAME \
-       -df oci_dist_training_artifacts/horovod/v1/oci_dist_training_artifacts/pytorch/v1/Dockerfile
+       -df oci_dist_training_artifacts/pytorch/v1/Dockerfile
       -s <code_dir>
 
 If you are behind proxy, ads opctl will automatically use your proxy settings (defined via ``no_proxy``, ``http_proxy`` and ``https_proxy``).
@@ -397,7 +397,15 @@ the output from the dry run will show all the actions and infrastructure configu
 
 .. include:: ../_test_and_submit.rst
 
-.. _hvd_saving_artifacts:
+**Monitoring the workload logs**
+
+To view the logs from a job run, you could run -
+
+.. code-block:: bash
+
+  ads opctl watch oci.xxxx.<job_run_ocid>
+
+You could stream the logs from any of the job run ocid using ``ads opctl watch`` command. You could run this command from multiple terminal to watch all of the job runs. Typically, watching ``mainJobRunId`` should yield most informative log.
 
 .. include:: ../_save_artifacts.rst