Multiprocessing not improving i/o bound performance - how to train effectively with larger datasets? #20458
Unanswered
openSourcerer9000
asked this question in
Q&A
Replies: 1 comment
-
You could create your own custom |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
So I'm trying to train a model on some n- dimensional data. The data is stored locally as a .zarr, And read into python as an xarray
dataset object ( lazily loaded and reading from disk during getitem). the whole dataset is larger than memory. This is my code:
The docs recommend PyDataset for multiprocessing. I tested and xarray datasets Can be pickled. If I have the multiprocessing flag on, my script Just freezes up at model.fit for several minutes, before starting. Whether or not multiprocessing is on, it runs excruciatingly slow (the same time/epoch either way), With my GPU just sitting at 4%. I'm not entirely convinced it's actually doing anything. however, with the flag off it seems the GPU sits at 0%, with occasional blips of 4%. having it on Has it at a continuous 4%, though it hangs for several minutes before starting, and Again upon completion, making it much slower than not using multiprocessing.
Has anyone successfully used keras with n-dim data in xarray, or With larger than memory datasets at all for that matter? Xarray supports state of the art parallelization with dask distributed, there seems to be a giant wrench in the machine somewhere with whatever keras is doing for parallelization.
Beta Was this translation helpful? Give feedback.
All reactions