Replies: 1 comment
-
I addressed your comment in the linked issue, so let's continue our discussion there. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi. I want to run a simple regex expression on a dataset, however I am running into the issue that it is not being cached. There was a similar issue (and pull request to fix it) for the nlp library (predecessor to datasets library). I was wondering if someone has any pointers on how to do it properly.
More concretely I want to run this expression with a re.findall() on a dataset.
pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
. This expression itself comes from GPT2Tokenizer, but I couldn't find any specific function that makes compiled regex expressions pliable underdatasets.map()
with caching.Thanks!
Beta Was this translation helpful? Give feedback.
All reactions