Streaming and creating refactored dataset with shards using Generator #7235
Unanswered
WillPowellUk
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to stream a dataset (i.e. to disk not to memory), refactor it using a generator and map, and then push it back to the hub. The following methodology acheives this but it is slow, due to the following error:
Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard.
N.B. there was a GitHub issue related here this but I cannot create a solution with
gen_kwargs
Here is my minimal reproduce-able code:
In the from_generator examples, it says that it should be implemented as follows:
Therefore I experimented with a new script to use
gen_kwargs
to take in a series of shards from another dataset.This works to remove the error (so i assume the num_procs is set to 16), however it is even slower than using one CPU.
Beta Was this translation helpful? Give feedback.
All reactions