-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Labels
Description
Mostly about improving the quality of the current TyposCorrector model:
- Improve the data quality: Improve TokenParser in cases containing abbreviations ml#403 and Integrate the neural token splitter ml#402 - the most important stuff, several stages of the pipeline depend strongly on it.
- -> Improve the vocabulary. Right now it's mostly fine, but with good splitting it will be much, much better.
- Work out the best fasttext configuration - I'm already alright with the one that I have, it's light and gives some boost to the quality, so it doesn't have that big priority already.
- Work on the model training configuration - haven't touched it yet, not sure that there is a much better one from the current default (most mistakes that I see now I can explain through bad splits or vocabulary or lack of training data (the last will go with training on the bigger dataset, that's easy ofc)).