[vocabs] Extend the list of predefined vocabularies #1883

felixdittrich92 · 2025-03-05T14:59:40Z

🚀 The feature

If we want to train / provide multilingual recognition models we need to extend our predefined vocabularies

reference PR: https://github.com/mindee/doctr/pull/1355/files

Here’s your updated checklist, extended with the missing languages from the list you provided (excluding the ones you've already marked as done):

Absolutely! Here's your list split by writing script. Some languages use multiple scripts (like Serbian or Uzbek), and I’ve grouped them accordingly.

Cyrillic

Latin

Arabic Script

Devanagari

marathi
nepali
sanskrit

Other Brahmic Scripts

Javanese Script

javanese

Sudanese Script

sundanese

Georgian Script

georgian

Armenian Script

armenian

Hebrew Script

yiddish

Ethiopic Script

amharic
oromo

Thaana / Burmese / Other Unique Scripts

burmese
lao
thai
khmer

East Asian Scripts

japanese
chinese (simplified)
korean (hangul)

Let me know if you'd like a markdown file version of this, or if you want visual icons, group headers, or collapsible sections!

Let me know if you'd like this grouped, sorted alphabetically, or saved to a file!

For example https://github.com/eymenefealtun/all-words-in-all-languages could be used to extract language specific charsets

The current multilingual vocabs entry can be extended with the new created language entries to provide a deduplicated list of a most complete multilingual char representation

latin_extended (german, spanish, czech, and so on), cyrillic and hebrew should be really low hanging fruits to include for training

ressources:

https://sites.google.com/site/worldfactsinc/Non-Latin-Script-Languages-Of-The-World

https://www.omniglot.com/writing/langalph.htm

The text was updated successfully, but these errors were encountered:

felixdittrich92 · 2025-03-05T15:03:35Z

sarjil77 · 2025-03-08T20:21:41Z

hey @felixdittrich92 ,

i am working on this.

felixdittrich92 · 2025-03-13T13:47:22Z

hey @felixdittrich92 ,

i am working on this.

@sarjil77 Thanks sounds great 👍
One PR / language please :)

Madhavi258 · 2025-03-18T15:30:17Z

hello @felixdittrich92,
I'm working on Russian vocabulary.

felixdittrich92 · 2025-03-18T15:33:57Z

hello @felixdittrich92, I'm working on Russian vocabulary.

Thanks 👍 It's assigned to you

felixdittrich92 · 2025-04-25T13:35:02Z

full utf-8 charlist: https://www.fileformat.info/info/charset/UTF-8/list.htm

this > https://symbl.cc/en/unicode-table/

SergShulga · 2025-05-06T10:27:08Z

hello @felixdittrich92, I'm working on Russian vocabulary.

Hello, please tell me, did you create a Russian dataset for training the model based on this dictionary? Or maybe there is already a text recognition model for the Russian language?

If not, can you advise which tool is best to create a dataset?

felixdittrich92 · 2025-05-06T10:35:39Z

hello @felixdittrich92, I'm working on Russian vocabulary.

Hello, please tell me, did you create a Russian dataset for training the model based on this dictionary? Or maybe there is already a text recognition model for the Russian language?

If not, can you advise which tool is best to create a dataset?

Hi @SergShulga 👋,

Not yet but working on a synth multilingual dataset currently

I have already collected some resources (wordlists / fonts / etc.)
https://huggingface.co/datasets/Felix92/docTR-resource-collection

From my experience a modified version from SynthTiger works best (nothing official yet - only my own dirty modified version 😅 https://github.com/felixdittrich92/synthtiger/tree/doctr-modified (branch: doctr-modified)

felixdittrich92 added the type: enhancement Improvement label Mar 5, 2025

felixdittrich92 added topic: documentation Improvements or additions to documentation module: datasets Related to doctr.datasets ext: docs Related to docs folder good first issue Good for newcomers labels Mar 5, 2025

felixdittrich92 added this to the 0.12.0 milestone Mar 5, 2025

felixdittrich92 assigned felixdittrich92 and sebastianMindee Mar 5, 2025

felixdittrich92 mentioned this issue Mar 17, 2025

Specify Tajik VOCAB for training #1899

Closed

felixdittrich92 mentioned this issue Apr 18, 2025

Multilingual support #1699

Open

felixdittrich92 pinned this issue Apr 24, 2025

felixdittrich92 linked a pull request Apr 30, 2025 that will close this issue

[datasets] Massively extend the pre-defined vocabs #1928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vocabs] Extend the list of predefined vocabularies #1883

[vocabs] Extend the list of predefined vocabularies #1883

felixdittrich92 commented Mar 5, 2025 •

edited

Loading

felixdittrich92 commented Mar 5, 2025 •

edited

Loading

sarjil77 commented Mar 8, 2025 •

edited

Loading

felixdittrich92 commented Mar 13, 2025

Madhavi258 commented Mar 18, 2025

felixdittrich92 commented Mar 18, 2025

felixdittrich92 commented Apr 25, 2025 •

edited

Loading

SergShulga commented May 6, 2025

felixdittrich92 commented May 6, 2025

[vocabs] Extend the list of predefined vocabularies #1883

[vocabs] Extend the list of predefined vocabularies #1883

Comments

felixdittrich92 commented Mar 5, 2025 • edited Loading

🚀 The feature

Cyrillic

Latin

Arabic Script

Devanagari

Other Brahmic Scripts

Javanese Script

Sudanese Script

Georgian Script

Armenian Script

Hebrew Script

Ethiopic Script

Thaana / Burmese / Other Unique Scripts

East Asian Scripts

felixdittrich92 commented Mar 5, 2025 • edited Loading

sarjil77 commented Mar 8, 2025 • edited Loading

felixdittrich92 commented Mar 13, 2025

Madhavi258 commented Mar 18, 2025

felixdittrich92 commented Mar 18, 2025

felixdittrich92 commented Apr 25, 2025 • edited Loading

SergShulga commented May 6, 2025

felixdittrich92 commented May 6, 2025

felixdittrich92 commented Mar 5, 2025 •

edited

Loading

felixdittrich92 commented Mar 5, 2025 •

edited

Loading

sarjil77 commented Mar 8, 2025 •

edited

Loading

felixdittrich92 commented Apr 25, 2025 •

edited

Loading