How to use custom tokenizer in spark NLP prediction pipeline? #13360

sakib-NSL · 2023-01-18T04:48:07Z

sakib-NSL
Jan 18, 2023

Hello everyone,
I am a beginner in Spark NLP. I have trained a Japanese Dataset on spark NLP using Bert embeddings. I have used spacy tokenizer and converted it to BIO format and then used the data in training. The result is satisfactory with the test data. But when I use the same test data on prediction pipeline, the performance decreases. I have used Tokenizer() and WordSegmenterModel() (alternatively) in prediction pipeline but did not work. Can I use a customized different tokeizer in pipeline?

Here is the training pipeline

Here is prediction pipeline

Questions:

Can I use a custom tokenizer in prediction pipeline?
Is there other reason for the test data's difference in performance in two pipelines?
This one might seem a little irrelevent here, I apologise for that. For bert Embedding, I am using .setMaxSentenceLength(1024) and it still works. Doesn't Bert has the limitation of using maximum 512 tokens? Are you splitting the text for long sequence or am I missing a point here ?

Thank you in advance.

maziyarpanahi · 2023-01-18T18:28:59Z

maziyarpanahi
Jan 18, 2023

Hi @sakib-NSL

Could you please provide a sample notebook for me to try and reproduce this?
It is not possible to set setMaxSentenceLength to anything larger than 512, we immediately throw an exception saying BERT model has a max of 512
I would suspect the tokenization being very different. (if possible I would compare the tokens coming from WordSegmenterModel which is the only tokenizer that supports Japanese and see many tokens where actually missed)

Having a simple sample code I can look deeper into this and see if we need to improve our WordSegmenterModel() and its models

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use custom tokenizer in spark NLP prediction pipeline? #13360

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to use custom tokenizer in spark NLP prediction pipeline? #13360

Uh oh!

sakib-NSL Jan 18, 2023

Here is the training pipeline

Here is prediction pipeline

Questions:

Replies: 1 comment

Uh oh!

maziyarpanahi Jan 18, 2023

sakib-NSL
Jan 18, 2023

maziyarpanahi
Jan 18, 2023