Keep multi-words in sparknlp.annotator.Tokenizer together #9021

a-kliuieva · 2022-06-04T17:46:56Z

a-kliuieva
Jun 4, 2022

I want to extract keywords using sparknlp.annotator.YakeKeywordExtraction, but first I need to tokenize my text.
My Spark df looks something like this:

+---+-----------------------------------------+
|id |                                     text|
+---+-----------------------------------------+
|1  |sun, venus, solar system, mars, milky way|
+---+-----------------------------------------+
|2  |moon, cosmic rays, stars, orion nebula   |
+---+-----------------------------------------+

After applying sparknlp.annotator.Tokenizer I need to keep all multiwords (like 'solar system', 'cosmic rays', 'milky way', etc.) together (as a single token).

If I use the following pipeline, my multiwords are broken into separate tokens:

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") 
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setContextChars([","]) 
tokenization = Pipeline(stages=[documentAssembler, tokenizer])
df = tokenization.fit(df).transform(df)

If I add .setExceptions([" "]) parameter to the Tokenizer(), then I get my entire string as one token (that is also wrong).

I've tried a different approach. I modified my dataframe to have each phrase as a new row:

+---+-----------------+
|id |             text|
+---+-----------------+
|1  |              sun|
+---+-----------------+
|1  |            venus|
+---+-----------------+
|1  |     solar system|
+---+-----------------+
|1  |             mars|
+---+-----------------+
|1  |        milky way|
+---+-----------------+
|2  |             moon|
+---+-----------------+
|2  |             ... |
+---+-----------------+

then applied the following pipeline:

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") 
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \
    .setContextChars([","]) \
    .setExceptions([" "])
tokenization = Pipeline(stages=[documentAssembler, tokenizer])
df = tokenization.fit(df).transform(df)

group_token = df.groupBy("id").agg(F.collect_list("tokens"))
group_token = group_token.withColumn("tokens", F.flatten("collect_list(tokens)"))

In this case, all multiwords are not split during tokenization and keep together.
However, when I apply YakeKeywordExtraction I get the following error:

yake_keywords = YakeKeywordExtraction() \
    .setInputCols(["tokens"]) \
    .setOutputCol("keywords") \
    .setThreshold(0.5) \
    .setMinNGrams(1) \
    .setMaxNGrams(1) \
    .setNKeywords(5) \
    .setWindowSize(2)

yake_df = yake_keywords.transform(group_token)

IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in YakeKeywordExtraction_02d8c88211de.

I compared the structure of output after usual tokenization (first approach) and the one I get by grouping separate tokens - they are completely identical except begin and end values. So I don't understand what is wrong.

So, If there is a way to keep multiwords during tokenization together I'll be very grateful for the recommendations!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keep multi-words in sparknlp.annotator.Tokenizer together #9021

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Keep multi-words in sparknlp.annotator.Tokenizer together #9021

Uh oh!

Uh oh!

a-kliuieva Jun 4, 2022

Replies: 0 comments

a-kliuieva
Jun 4, 2022