-
Notifications
You must be signed in to change notification settings - Fork 0
Description
This looks really nice. Just a general comment on Dask at a large scale. The issue we came across was that Dask will deploy as many jobs as possible to a single task which will mean if you set your cluster parameters such that you want 5 GB of memory because that is how much on job requires, it will violate it immediately as it just gives this amount of memory to one worker who can run as many jobs as possible.
The way I am trying to avoid this is to assign worker resources which limit the number of tasks it can deploy, e.g. 1 GPU per model training means only one model can run on the worker at a time. Or, for espresso, 1 "espresso". But in the case you have here, if you task was a large matrix computation and you heavily parallelised over nodes, I think you would reach a dead worker pretty fast.
Originally posted by @SamTov in #21 (review)