Replies: 1 comment 2 replies
-
Not really - unless I'm mistaken the only way to do this would be to manually modify the code so that it only processed one crawl, but that would still be very compute intensive (@adarob can confirm). FWIW there is a rumor that AI2 will release mC4 like they did for C4. Maybe @dodgejesse can comment. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi! Not sure if it is the right place to ask this question. Since the mT5 repo doesn't have any Discussion bar, I'm asking it here.
Is there any way we can download a small portion of mC4? More than 20T seems really large space. It would be nice if we can download the dataset partially (language-wise or even language-wise random samples).
Beta Was this translation helpful? Give feedback.
All reactions