Dolma dedup import or reimplementation #57
chris-ha458
started this conversation in
Ideas
Replies: 1 comment
-
Good idea. I found it best to write Python wrappers for multithreaded rust programs, e.g., Google's suffix array is wrapped in Python in this repo. It is probably easier to create a specific script for this purpose as well instead of modifying the current implementation. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
dolma repo, which is a data curation software built for dolma dataset, includes a rust based thread safe bloomfilter dedup implementation. (Unclear if it is multithreaded yet)
currently, this repo's bloomfilter implementation is python based and single threaded ( both in the embedding/processing stage).
Since both codebases are apache 2.0 i do not foresee any licensing issue in either reimplementing or importing it for use here.
Beta Was this translation helpful? Give feedback.
All reactions