Replies: 1 comment
-
Hey, @int8! With #2783, we added the support for custom Tokenizer in PreProcessor. So you can do the following: import pickle
from haystack.nodes.preprocessor import PreProcessor
from nltk.tokenize.punkt import PunktParameters, PunktSentenceTokenizer
# define your custom tokenizer
punkt_param = PunktParameters()
abbreviation = ['m', 'y','c','ustom','abbr','eviations']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
# save the tokenizer
with open('./my_tokenizer/pl.pickle', 'wb') as fo: #it's fundamental that the tokenizer filename corresponds to the language code
pickle.dump(tokenizer, fo)
# initialize the Preprocessor and specify tokenizer_model_folder
processor = PreProcessor(
...
language="pl", # the language must match with the tokenizer file name
tokenizer_model_folder="./my_tokenizer/"
) Hope it works for you... |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello there,
I have the following case:
I use
Preprocessor
to chunk big text documents into smaller pieces,As simple as:
I request split size to be less than 100 words + I want full sentences to be included in my chunks.
My input text naturally has lots of dots (as a result of frequent abbreviations). What I think I need to do is to tell haystack to pass extra
abbreviation
parameters tosent_tokenizer
that is used underneath, something along:Not sure that is possible, what options do I have?
Beta Was this translation helpful? Give feedback.
All reactions