Skip to content

Improve typos analyzer quality #758

@irinakhismatullina

Description

@irinakhismatullina

Mostly about improving the quality of the current TyposCorrector model:

  • Improve the data quality: Improve TokenParser in cases containing abbreviations ml#403 and Integrate the neural token splitter ml#402 - the most important stuff, several stages of the pipeline depend strongly on it.
  • -> Improve the vocabulary. Right now it's mostly fine, but with good splitting it will be much, much better.
  • Work out the best fasttext configuration - I'm already alright with the one that I have, it's light and gives some boost to the quality, so it doesn't have that big priority already.
  • Work on the model training configuration - haven't touched it yet, not sure that there is a much better one from the current default (most mistakes that I see now I can explain through bad splits or vocabulary or lack of training data (the last will go with training on the bigger dataset, that's easy ofc)).

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions