|
| 1 | +# Using TensorHive for running distributed trainings using TF_CONFIG |
| 2 | + |
| 3 | +This example shows how to use the TensorHive `task nursery` module to |
| 4 | +conveniently orchestrate distributed trainings configured using |
| 5 | +the TF_CONFIG environment variable. This |
| 6 | +[MSG-GAN training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV2_MSG-GAN_Fashion-MNIST) |
| 7 | +was used for the example. |
| 8 | + |
| 9 | +## Running the training without TensorHive |
| 10 | + |
| 11 | +In order to run the training manually, a separate process `python train.py` |
| 12 | +has to be run on each node with the appropriate values of parameters set as follows. |
| 13 | + |
| 14 | +**TF_CONFIG** |
| 15 | + |
| 16 | +The TF_CONFIG environment variable has to be appropriately configured depending |
| 17 | +on the set of nodes taking part in the computations. |
| 18 | +For example, a training on two nodes gl01 and gl02 would require the following |
| 19 | +settings of TF_CONFIG: |
| 20 | + |
| 21 | +gl01: |
| 22 | +```bash |
| 23 | +TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 0}}' |
| 24 | +``` |
| 25 | + |
| 26 | +gl02: |
| 27 | +```bash |
| 28 | +TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 1}}' |
| 29 | +``` |
| 30 | + |
| 31 | +**Other environment variables** |
| 32 | + |
| 33 | +Depending on the environment, some other environment variables may have to be configured. |
| 34 | +For example, because our TensorFlow compilation uses a custom MPI library, the LD_LIBRARY_PATH environment |
| 35 | +variable has to be set for each process to /usr/mpi/gcc/openmpi-4.0.0rc5/lib/. |
| 36 | + |
| 37 | +**Choosing the appropriate Python version** |
| 38 | + |
| 39 | +In some cases, a specific Python binary has to be used for the training. |
| 40 | +For example, in our environment, a python binary from a virtual environment |
| 41 | +is used, so the python binary has to be defined as follows: |
| 42 | + |
| 43 | +``` |
| 44 | +/home/roy/venv/p37avxmpitf2/bin/python |
| 45 | +``` |
| 46 | + |
| 47 | +**Summary** |
| 48 | + |
| 49 | +Finally, full commands required to start in the exemplary setup our environment, will be as follows: |
| 50 | + |
| 51 | +gl01: |
| 52 | + |
| 53 | +```bash |
| 54 | +export TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 0}}' |
| 55 | +export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/' |
| 56 | +/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py |
| 57 | +``` |
| 58 | + |
| 59 | +gl02: |
| 60 | + |
| 61 | +``` |
| 62 | +export TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 1}}' |
| 63 | +export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/' |
| 64 | +/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py |
| 65 | +``` |
| 66 | + |
| 67 | + |
| 68 | +## Running the training with TensorHive |
| 69 | + |
| 70 | +The TensorHive `task nursery` module allows convenient orchestration of distributed trainings. |
| 71 | +It is available in the Tasks Overview view. The `CREATE TASKS FROM TEMPLATE` button allows to |
| 72 | +conveniently configure tasks supporting a specific framework or distribution method. In this |
| 73 | +example we choose the Tensorflow - TF_CONFIG template, and click `GO TO TASK CREATOR`: |
| 74 | + |
| 75 | + |
| 76 | + |
| 77 | +In the task creator, we set the Command to |
| 78 | +``` |
| 79 | +/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py |
| 80 | +``` |
| 81 | + |
| 82 | +In order to add the LD_LIBRARY_PATH environment variable, we enter the parameter name, |
| 83 | +select Static (the same value for all processes) and click `ADD AS ENV VARIABLE TO ALL TASKS`: |
| 84 | + |
| 85 | + |
| 86 | + |
| 87 | +Then, set the appropriate value of the environment variable (/usr/mpi/gcc/openmpi-4.0.0rc5/lib/). |
| 88 | + |
| 89 | +The task creator allows also to conveniently specify other command-line arguments. For example, |
| 90 | +to specify batch size, we enter parameter name --batch_size, again select Static and click |
| 91 | +`ADD AS PARAMETER TO ALL TASKS` and set its value (in our case 32). |
| 92 | + |
| 93 | +Select the required hostname and resource (CPU/GPU_N) for the specified training process. The resultant |
| 94 | +command that will be executed by TensorHive on the selected node will be displayed above the process specification: |
| 95 | + |
| 96 | + |
| 97 | + |
| 98 | +Note that the TF_CONFIG and CUDA_VISIBLE_DEVICES variables are configured automatically. Now, use |
| 99 | +the `ADD TASK` button to duplicate the processes and modify the required target hosts to create |
| 100 | +your training processes. For example, this screenshot shows the configuration for training on 4 |
| 101 | +hosts: gl01, gl02, gl03, gl04: |
| 102 | + |
| 103 | + |
| 104 | + |
| 105 | +After clicking the `CREATE ALL TASKS` button, the processes will be available in the process list for future actions. |
| 106 | +To run the processes, select them and use the `Spawn selected tasks` button. If TensorHive is configured properly, |
| 107 | +the task status should change to `running`: |
| 108 | + |
| 109 | + |
| 110 | + |
| 111 | +Note, that the appropriate process PID will be displayed in the `pid` column. Task overview can |
| 112 | +be used to schedule, spawn, stop, kill, edit the tasks, and see logs from their execution. |
0 commit comments