Sentence tokenizer model parameters in PreProcessor #5516

int8 · 2023-08-04T18:43:36Z

int8
Aug 4, 2023

Hello there,

I have the following case:
I use Preprocessor to chunk big text documents into smaller pieces,
As simple as:

processor = PreProcessor(
        clean_empty_lines=True,
        clean_whitespace=True,
        clean_header_footer=True,
        split_by="word",
        progress_bar=False,
        split_length=100,
        split_respect_sentence_boundary=True,
        split_overlap=10,
        max_chars_check=30_000,
        language="pl"
    )

I request split size to be less than 100 words + I want full sentences to be included in my chunks.

My input text naturally has lots of dots (as a result of frequent abbreviations). What I think I need to do is to tell haystack to pass extra abbreviation parameters to sent_tokenizer that is used underneath, something along:

punkt_param = PunktParameters()
abbreviation = ['m', 'y','c','ustom','abbr','eviations']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)

Not sure that is possible, what options do I have?

anakin87 · 2023-08-07T10:47:24Z

anakin87
Aug 7, 2023
Maintainer

Hey, @int8!
To answer you, I had to dive into the code and I found a not-so-nice workaround.

With #2783, we added the support for custom Tokenizer in PreProcessor.

So you can do the following:

import pickle
from haystack.nodes.preprocessor import PreProcessor
from nltk.tokenize.punkt import PunktParameters, PunktSentenceTokenizer

# define your custom tokenizer
punkt_param = PunktParameters()
abbreviation = ['m', 'y','c','ustom','abbr','eviations']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)

# save the tokenizer
with open('./my_tokenizer/pl.pickle', 'wb') as fo: #it's fundamental that the tokenizer filename corresponds to the language code
    pickle.dump(tokenizer, fo)

# initialize the Preprocessor and specify tokenizer_model_folder
processor = PreProcessor(
        ...
        language="pl",    # the language must match with the tokenizer file name
        tokenizer_model_folder="./my_tokenizer/"
    )

Hope it works for you...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sentence tokenizer model parameters in PreProcessor #5516

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Sentence tokenizer model parameters in PreProcessor #5516

Uh oh!

int8 Aug 4, 2023

Replies: 1 comment

Uh oh!

anakin87 Aug 7, 2023 Maintainer

int8
Aug 4, 2023

anakin87
Aug 7, 2023
Maintainer