Skip to content

[vocabs] Extend the list of predefined vocabularies #1883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
69 tasks
felixdittrich92 opened this issue Mar 5, 2025 · 8 comments · May be fixed by #1928
Open
69 tasks

[vocabs] Extend the list of predefined vocabularies #1883

felixdittrich92 opened this issue Mar 5, 2025 · 8 comments · May be fixed by #1928
Assignees
Labels
ext: docs Related to docs folder good first issue Good for newcomers module: datasets Related to doctr.datasets topic: documentation Improvements or additions to documentation type: enhancement Improvement
Milestone

Comments

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Mar 5, 2025

🚀 The feature

If we want to train / provide multilingual recognition models we need to extend our predefined vocabularies

reference PR: https://github.com/mindee/doctr/pull/1355/files

Here’s your updated checklist, extended with the missing languages from the list you provided (excluding the ones you've already marked as done):

Absolutely! Here's your list split by writing script. Some languages use multiple scripts (like Serbian or Uzbek), and I’ve grouped them accordingly.


Cyrillic


Latin


Arabic Script

  • persian
  • pashto
  • urdu
  • sindhi
  • uyghur
  • kurdish (arabic)

Devanagari

  • marathi
  • nepali
  • sanskrit

Other Brahmic Scripts

  • assamese
  • kannada
  • malayalam
  • oriya
  • punjabi
  • tamil
  • telugu
  • sinhala

Javanese Script

  • javanese

Sudanese Script

  • sundanese

Georgian Script

  • georgian

Armenian Script

  • armenian

Hebrew Script

  • yiddish

Ethiopic Script

  • amharic
  • oromo

Thaana / Burmese / Other Unique Scripts

  • burmese
  • lao
  • thai
  • khmer

East Asian Scripts

  • japanese
  • chinese (simplified)
  • korean (hangul)

Let me know if you'd like a markdown file version of this, or if you want visual icons, group headers, or collapsible sections!

Let me know if you'd like this grouped, sorted alphabetically, or saved to a file!

For example https://github.com/eymenefealtun/all-words-in-all-languages could be used to extract language specific charsets

The current multilingual vocabs entry can be extended with the new created language entries to provide a deduplicated list of a most complete multilingual char representation

latin_extended (german, spanish, czech, and so on), cyrillic and hebrew should be really low hanging fruits to include for training

ressources:

https://sites.google.com/site/worldfactsinc/Non-Latin-Script-Languages-Of-The-World

https://www.omniglot.com/writing/langalph.htm

@felixdittrich92 felixdittrich92 added the type: enhancement Improvement label Mar 5, 2025
@felixdittrich92
Copy link
Contributor Author

felixdittrich92 commented Mar 5, 2025

Already done:

  • latin
  • english
  • french
  • portuguese
  • spanish
  • italian
  • german
  • arabic
  • czech
  • polish
  • dutch
  • norwegian
  • danish
  • finnish
  • swedish
  • vietnamese
  • hebrew
  • hindi
  • gujarati
  • bangla
  • ukrainian
  • russian
  • croatian

@felixdittrich92 felixdittrich92 added topic: documentation Improvements or additions to documentation module: datasets Related to doctr.datasets ext: docs Related to docs folder good first issue Good for newcomers labels Mar 5, 2025
@felixdittrich92 felixdittrich92 added this to the 0.12.0 milestone Mar 5, 2025
@sarjil77
Copy link
Contributor

sarjil77 commented Mar 8, 2025

hey @felixdittrich92 ,

i am working on this.

@felixdittrich92
Copy link
Contributor Author

hey @felixdittrich92 ,

i am working on this.

@sarjil77 Thanks sounds great 👍
One PR / language please :)

@Madhavi258
Copy link
Contributor

hello @felixdittrich92,
I'm working on Russian vocabulary.

@felixdittrich92
Copy link
Contributor Author

hello @felixdittrich92, I'm working on Russian vocabulary.

Thanks 👍 It's assigned to you

@felixdittrich92
Copy link
Contributor Author

felixdittrich92 commented Apr 25, 2025

@SergShulga
Copy link

hello @felixdittrich92, I'm working on Russian vocabulary.

Hello, please tell me, did you create a Russian dataset for training the model based on this dictionary? Or maybe there is already a text recognition model for the Russian language?

If not, can you advise which tool is best to create a dataset?

@felixdittrich92
Copy link
Contributor Author

hello @felixdittrich92, I'm working on Russian vocabulary.

Hello, please tell me, did you create a Russian dataset for training the model based on this dictionary? Or maybe there is already a text recognition model for the Russian language?

If not, can you advise which tool is best to create a dataset?

Hi @SergShulga 👋,

Not yet but working on a synth multilingual dataset currently

I have already collected some resources (wordlists / fonts / etc.)
https://huggingface.co/datasets/Felix92/docTR-resource-collection

From my experience a modified version from SynthTiger works best (nothing official yet - only my own dirty modified version 😅 https://github.com/felixdittrich92/synthtiger/tree/doctr-modified (branch: doctr-modified)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: docs Related to docs folder good first issue Good for newcomers module: datasets Related to doctr.datasets topic: documentation Improvements or additions to documentation type: enhancement Improvement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants