-
Notifications
You must be signed in to change notification settings - Fork 502
[vocabs] Extend the list of predefined vocabularies #1883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Already done:
|
hey @felixdittrich92 , i am working on this. |
@sarjil77 Thanks sounds great 👍 |
hello @felixdittrich92, |
Thanks 👍 It's assigned to you |
full utf-8 charlist: https://www.fileformat.info/info/charset/UTF-8/list.htm |
Hello, please tell me, did you create a Russian dataset for training the model based on this dictionary? Or maybe there is already a text recognition model for the Russian language? If not, can you advise which tool is best to create a dataset? |
Hi @SergShulga 👋, Not yet but working on a synth multilingual dataset currently I have already collected some resources (wordlists / fonts / etc.) From my experience a modified version from SynthTiger works best (nothing official yet - only my own dirty modified version 😅 https://github.com/felixdittrich92/synthtiger/tree/doctr-modified (branch: doctr-modified) |
🚀 The feature
If we want to train / provide multilingual recognition models we need to extend our predefined vocabularies
reference PR: https://github.com/mindee/doctr/pull/1355/files
Here’s your updated checklist, extended with the missing languages from the list you provided (excluding the ones you've already marked as done):
Absolutely! Here's your list split by writing script. Some languages use multiple scripts (like Serbian or Uzbek), and I’ve grouped them accordingly.
Cyrillic
Latin
Arabic Script
Devanagari
Other Brahmic Scripts
Javanese Script
Sudanese Script
Georgian Script
Armenian Script
Hebrew Script
Ethiopic Script
Thaana / Burmese / Other Unique Scripts
East Asian Scripts
Let me know if you'd like a markdown file version of this, or if you want visual icons, group headers, or collapsible sections!
Let me know if you'd like this grouped, sorted alphabetically, or saved to a file!
For example https://github.com/eymenefealtun/all-words-in-all-languages could be used to extract language specific charsets
The current
multilingual
vocabs entry can be extended with the new created language entries to provide a deduplicated list of a most complete multilingual char representationlatin_extended (german, spanish, czech, and so on), cyrillic and hebrew should be really low hanging fruits to include for training
ressources:
https://sites.google.com/site/worldfactsinc/Non-Latin-Script-Languages-Of-The-World
https://www.omniglot.com/writing/langalph.htm
The text was updated successfully, but these errors were encountered: