Keep multi-words in sparknlp.annotator.Tokenizer together #9021
Unanswered
a-kliuieva
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to extract keywords using
sparknlp.annotator.YakeKeywordExtraction
, but first I need to tokenize my text.My Spark df looks something like this:
After applying
sparknlp.annotator.Tokenizer
I need to keep all multiwords (like 'solar system', 'cosmic rays', 'milky way', etc.) together (as a single token).If I use the following pipeline, my multiwords are broken into separate tokens:
If I add
.setExceptions([" "])
parameter to theTokenizer()
, then I get my entire string as one token (that is also wrong).I've tried a different approach. I modified my dataframe to have each phrase as a new row:
then applied the following pipeline:
In this case, all multiwords are not split during tokenization and keep together.
However, when I apply
YakeKeywordExtraction
I get the following error:IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in YakeKeywordExtraction_02d8c88211de.
I compared the structure of output after usual tokenization (first approach) and the one I get by grouping separate tokens - they are completely identical except
begin
andend
values. So I don't understand what is wrong.So, If there is a way to keep multiwords during tokenization together I'll be very grateful for the recommendations!!!
Beta Was this translation helpful? Give feedback.
All reactions