Skip to content

Commit fbd0fdf

Browse files
authored
Merge pull request #240 from roscisz/develop
r0.3.2
2 parents c614970 + 9368edd commit fbd0fdf

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+2218
-887
lines changed

README.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
TensorHive
22
===
3-
![](https://img.shields.io/badge/release-v0.3.1-brightgreen.svg?style=popout-square)
4-
![](https://img.shields.io/badge/pypi-v0.3.1-brightgreen.svg?style=popout-square)
3+
![](https://img.shields.io/badge/release-v0.3.2-brightgreen.svg?style=popout-square)
4+
![](https://img.shields.io/badge/pypi-v0.3.2-brightgreen.svg?style=popout-square)
55
![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square)
66
![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square)
77
![](https://img.shields.io/badge/hardware-Nvidia-green.svg?style=popout-square)
@@ -150,7 +150,8 @@ Features
150150
#### Core
151151
- [x] :mag_right: Monitor metrics on each host
152152
- [x] :tm: Nvidia GPUs
153-
- [ ] :pager: CPU, RAM, HDD
153+
- [x] :pager: CPU, RAM
154+
- [ ] :open_file_folder: HDD
154155
- [x] :customs: Protection of reserved resources
155156
- [x] :warning: Send warning messages to terminal of users who violate the rules
156157
- [x] :mailbox_with_no_mail: Send e-mail warnings
@@ -224,19 +225,21 @@ This diagram will help you to grasp the rough concept of the system.
224225

225226
Contibution and feedback
226227
------------------------
227-
**Project is still in early beta version**, so there will be some inconveniences, just be patient and keep an eye on upcoming updates.
228-
229228
We'd :heart: to collect your observations, issues and pull requests!
230229

231230
Feel free to **report any configuration problems, we will help you**.
232231

233-
We plan to develop examples of running distributed DNN training applications
234-
in `Task nursery` along with templates for TF_CONFIG and PyTorch, deadline - March 2020 :shipit:, so stay tuned!
232+
We are working on user groups for differentiated GPU access control,
233+
grouping tasks into jobs and process-killing reservation violation handler,
234+
deadline - July 2020 :shipit:, so stay tuned!
235+
236+
If you consider becoming a contributor, please look at issues labeled as
237+
[**good-first-issue**](https://github.com/roscisz/TensorHive/issues?q=is%3Aissue+is%3Aopen+label%3Agood-first-issue)
238+
and
239+
[**help wanted**](https://github.com/roscisz/TensorHive/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22).
235240

236241
Credits
237242
-------
238-
239-
240243
TensorHive has been greatly supported within a joint project between [**VoiceLab.ai**](https://voicelab.ai) and
241244
[**Gdańsk University of Technology**](https://pg.edu.pl/) titled: "Exploration and selection of methods
242245
for parallelization of neural network training using multiple GPUs".

examples/TF_CONFIG/README.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Using TensorHive for running distributed trainings using TF_CONFIG
2+
3+
This example shows how to use the TensorHive `task nursery` module to
4+
conveniently orchestrate distributed trainings configured using
5+
the TF_CONFIG environment variable. This
6+
[MSG-GAN training application](https://github.com/roscisz/dnn_training_benchmarks/tree/master/TensorFlowV2_MSG-GAN_Fashion-MNIST)
7+
was used for the example.
8+
9+
## Running the training without TensorHive
10+
11+
In order to run the training manually, a separate process `python train.py`
12+
has to be run on each node with the appropriate values of parameters set as follows.
13+
14+
**TF_CONFIG**
15+
16+
The TF_CONFIG environment variable has to be appropriately configured depending
17+
on the set of nodes taking part in the computations.
18+
For example, a training on two nodes gl01 and gl02 would require the following
19+
settings of TF_CONFIG:
20+
21+
gl01:
22+
```bash
23+
TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 0}}'
24+
```
25+
26+
gl02:
27+
```bash
28+
TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 1}}'
29+
```
30+
31+
**Other environment variables**
32+
33+
Depending on the environment, some other environment variables may have to be configured.
34+
For example, because our TensorFlow compilation uses a custom MPI library, the LD_LIBRARY_PATH environment
35+
variable has to be set for each process to /usr/mpi/gcc/openmpi-4.0.0rc5/lib/.
36+
37+
**Choosing the appropriate Python version**
38+
39+
In some cases, a specific Python binary has to be used for the training.
40+
For example, in our environment, a python binary from a virtual environment
41+
is used, so the python binary has to be defined as follows:
42+
43+
```
44+
/home/roy/venv/p37avxmpitf2/bin/python
45+
```
46+
47+
**Summary**
48+
49+
Finally, full commands required to start in the exemplary setup our environment, will be as follows:
50+
51+
gl01:
52+
53+
```bash
54+
export TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 0}}'
55+
export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/'
56+
/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py
57+
```
58+
59+
gl02:
60+
61+
```
62+
export TF_CONFIG='{"cluster":{"worker":["gl01:2222", "gl02:2222"]}, "task":{"type": "worker", "index": 1}}'
63+
export LD_LIBRARY_PATH='/usr/mpi/gcc/openmpi-4.0.0rc5/lib/'
64+
/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py
65+
```
66+
67+
68+
## Running the training with TensorHive
69+
70+
The TensorHive `task nursery` module allows convenient orchestration of distributed trainings.
71+
It is available in the Tasks Overview view. The `CREATE TASKS FROM TEMPLATE` button allows to
72+
conveniently configure tasks supporting a specific framework or distribution method. In this
73+
example we choose the Tensorflow - TF_CONFIG template, and click `GO TO TASK CREATOR`:
74+
75+
![choose_template](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/choose_template.png)
76+
77+
In the task creator, we set the Command to
78+
```
79+
/home/roy/venv/p37avxmpitf2/bin/python /home/roy/dnn_training_benchmarks/TensorFlowV2_MSG-GAN_Fashion-MNIST/train.py
80+
```
81+
82+
In order to add the LD_LIBRARY_PATH environment variable, we enter the parameter name,
83+
select Static (the same value for all processes) and click `ADD AS ENV VARIABLE TO ALL TASKS`:
84+
85+
![env_var](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/env_var.png)
86+
87+
Then, set the appropriate value of the environment variable (/usr/mpi/gcc/openmpi-4.0.0rc5/lib/).
88+
89+
The task creator allows also to conveniently specify other command-line arguments. For example,
90+
to specify batch size, we enter parameter name --batch_size, again select Static and click
91+
`ADD AS PARAMETER TO ALL TASKS` and set its value (in our case 32).
92+
93+
Select the required hostname and resource (CPU/GPU_N) for the specified training process. The resultant
94+
command that will be executed by TensorHive on the selected node will be displayed above the process specification:
95+
96+
![single_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/single_process.png)
97+
98+
Note that the TF_CONFIG and CUDA_VISIBLE_DEVICES variables are configured automatically. Now, use
99+
the `ADD TASK` button to duplicate the processes and modify the required target hosts to create
100+
your training processes. For example, this screenshot shows the configuration for training on 4
101+
hosts: gl01, gl02, gl03, gl04:
102+
103+
![multi_process](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png)
104+
105+
After clicking the `CREATE ALL TASKS` button, the processes will be available in the process list for future actions.
106+
To run the processes, select them and use the `Spawn selected tasks` button. If TensorHive is configured properly,
107+
the task status should change to `running`:
108+
109+
![running](https://github.com/roscisz/TensorHive/tree/master/examples/TF_CONFIG/img/multi_process.png)
110+
111+
Note, that the appropriate process PID will be displayed in the `pid` column. Task overview can
112+
be used to schedule, spawn, stop, kill, edit the tasks, and see logs from their execution.
13.2 KB
Loading

examples/TF_CONFIG/img/env_var.png

8.62 KB
Loading
153 KB
Loading

examples/TF_CONFIG/img/running.png

97.5 KB
Loading
32.5 KB
Loading

setup.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,12 @@
1414
'tensorhive = tensorhive.__main__:main'
1515
],
1616
},
17-
description='Lightweight computing resource management tool for executing distributed TensorFlow programs',
18-
author='Pawel Rosciszewski, Michal Martyniak, Filip Schodowski, Tomasz Menet',
17+
description='A user-friendly GPU management tool for distributed machine learning workloads',
18+
author='Pawel Rosciszewski, Michal Martyniak, Filip Schodowski',
1919
author_email='pawel.rosciszewski@pg.edu.pl',
2020
url='https://github.com/roscisz/TensorHive',
2121
download_url='https://github.com/roscisz/TensorHive/archive/{}.tar.gz'.format(tensorhive.__version__),
22-
keywords='distributed machine learning tensorflow resource management',
22+
keywords='reservation monitoring machine learning distributed tensorflow pytorch',
2323
install_requires=[
2424
'parallel-ssh==1.9.1',
2525
'passlib==1.7.1',

tensorhive/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '0.3.1'
1+
__version__ = '0.3.2'

tensorhive/api/api_specification.yml

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -571,6 +571,29 @@ paths:
571571
description: {{RESPONSES['general']['auth_error']}}
572572
security:
573573
- Bearer: []
574+
/nodes/{hostname}/cpu/metrics:
575+
get:
576+
tags:
577+
- nodes
578+
summary: Get node's CPU metric data
579+
description: Puts null if some data is unavailable
580+
operationId: tensorhive.controllers.nodes.cpu_controller.get_metrics
581+
parameters:
582+
- $ref: '#/parameters/hostnameParam'
583+
- $ref: '#/parameters/cpuMetricTypeQuery'
584+
responses:
585+
200:
586+
description: {{RESPONSES['general']['ok']}}
587+
schema:
588+
$ref: '#/definitions/CPUMetrics'
589+
401:
590+
description: {{RESPONSES['general']['unauthorized']}}
591+
404:
592+
description: {{RESPONSES['nodes']['hostname']['not_found']}}
593+
422:
594+
description: {{RESPONSES['general']['auth_error']}}
595+
security:
596+
- Bearer: []
574597
/nodes/{hostname}/gpu/processes:
575598
get:
576599
tags:
@@ -1400,7 +1423,7 @@ definitions:
14001423
type: object
14011424
example:
14021425
<GPU_UUID (All metrics case)>:
1403-
gpu_util:
1426+
utilization:
14041427
unit: '%'
14051428
value: 95
14061429
power:
@@ -1409,6 +1432,8 @@ definitions:
14091432
<GPU_UUID (Specific metric case)>:
14101433
unit: '%'
14111434
value: 95
1435+
CPUMetrics:
1436+
type: object
14121437
GPUProcesses:
14131438
type: object
14141439
example:
@@ -1437,10 +1462,21 @@ parameters:
14371462
- mem_free
14381463
- mem_used
14391464
- mem_total
1440-
- gpu_util
1465+
- utilization
14411466
- mem_util
14421467
- temp
14431468
- power
1469+
cpuMetricTypeQuery:
1470+
description: Metric type. If not present, queries for all metrics
1471+
in: query
1472+
name: metric_type
1473+
required: false
1474+
type: string
1475+
enum:
1476+
- mem_free
1477+
- mem_used
1478+
- mem_total
1479+
- utilization
14441480
securityDefinitions:
14451481
Bearer:
14461482
type: apiKey

0 commit comments

Comments
 (0)