Replies: 1 comment 2 replies
-
We started developing a way to switch the parallel backend of
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been using huggingface's dataset to deal with some computer vision tasks, and
dataset.map(..., num_proc > 1)
is really handy to run things in parallel. But most of my processing happens in C libs that doesn't actually holds the GIL (like OpenCV, numpy, Pillow, etc), and spinning up new subprocesses seems a bit overkill (specially becausedataset.map
uses fork method, which may carry some parent process' lifecycle callbacks, for example.I wonder if it would be interesting for
dataset.map()
to receive an optionalpool: concurrent.futures.Executor
parameter, that would be either aProcessPoolExecutoor
or aThreadPoolExecutor
, so the caller could choose which type of parallelization better suites their use case.There is a gotcha with this proposal, which is the fact that
ProcessPoolExecutoor
seems to usespawn
instead of the currentfork
approach used bydataset.map
, so it wouldn't work with inner functions and lambdas. Because of this, we probably shouldn't replace the currentnum_proc
way of doing it, but I wonder if a newpool
parameter could be useful for more people.If this proposal seems reasonable, I can prepare a PR to further discuss the implementation.
EDIT: related bug report: #5976
Beta Was this translation helpful? Give feedback.
All reactions