Multilingual support #1699

decadance-dance · 2024-08-20T12:39:20Z

🚀 The feature

Support of multiple languages (accordingly VOCABS["multilingual"]) by pretrained models.

Motivation, pitch

It would be great to use models which supports multiple languages because it significantly improve user experience in various cases.

Alternatives

No response

Additional context

No response

felixdittrich92 · 2024-08-20T13:35:52Z

Hi @decadance-dance 👋,

Have you already tried:
docTR: https://huggingface.co/Felix92/doctr-torch-parseq-multilingual-v1
OnnxTR: https://huggingface.co/Felix92/onnxtr-parseq-multilingual-v1
? :)

Depends a bit if there is any data from mindee we could use.
Question goes to @odulcy-mindee ^^

decadance-dance · 2024-08-20T16:12:10Z

Hi, @felixdittrich92
I used docTR more than half year but have never faced this multilingual model, lol.
So, I am gonna try it, thanks.

felixdittrich92 · 2024-08-20T16:34:23Z

Ah let's keep this issue open there is more todo i think :)

felixdittrich92 · 2024-08-21T07:44:26Z

Hi, @felixdittrich92 I used docTR more than half year but have never faced this multilingual model, lol. So, I am gonna try it, thanks.

Happy about an feedback how it works for you :)
The model was fine tuned only on synth data.

odulcy-mindee · 2024-08-27T08:43:43Z

Depends a bit if there is any data from mindee we could use.
Question goes to @odulcy-mindee ^^

Unfortunately, we don't have such data

felixdittrich92 · 2024-08-27T08:55:50Z

@decadance-dance
For training such recognition models i don't see a problem.. we can generate synth train data and need in a best case only real val samples.
But for detection we would need real data that's the main issue.

In general we would need the help of the community to collect documents (newspaper, receipt photos, etc.) in divers langauges (can be unlabeled). / This would need a license to sign that we can freely use this data.
With enough divers data we could use Azure Doc AI for example to pre-label this data.
Later on i wouldn't see an issue to open source this dataset.

But not sure how to trigger such "event" 😅 @odulcy-mindee

nikokks · 2024-09-06T13:33:39Z

Hello =)
I found some public dataset for various tasks
english documents
mathematics documents
latex ocr
latex ocr
chinese ocr
chinese ocr
chinese ocr

nikokks · 2024-09-06T13:36:38Z

Moreover it should be interesting for Chinese detection models to add multiple recognition data in the same image without intersection. This should help for a Chinese detection model to perform better without real detection data.
Anyone interested in creating random multilingual data for detection models (hindi, chinese, etc.) ?

felixdittrich92 · 2024-09-06T14:13:07Z

Hi @nikokks 😃
Recognition should not be such a big deal i found already a good way to generate such data for fine tuning.

To collect multilingual data for detection is troublesome because it should be real data (or if possible really good generated ones / for example with a fine tuned FLUX model maybe !?)
We need different kinds of layouts/documents (newspapers, invoices, receipts, cards, etc.) so the data should come close to real use cases (not only scans also document photos etc.)
:)

decadance-dance · 2024-10-09T16:06:47Z

To collect multilingual data for detection is troublesome because it should be real data

Do you can estimate how much data we need to provide multilingual capabilities on the same level as only english ocr is?

felixdittrich92 · 2024-10-10T06:14:30Z

Hi @decadance-dance 👋,

I think if we could collect ~100-150 different types of documents for each language we would have a good starting point (at the end the language doesn't matter it's more about the different char sets / fonts / text sizes) - for example:

is super useful because it captures a lot of different fonts / text sizes
or something "in the wild":

At the end it's more critical to take care that we really can use such images legally.

The tricky part is the detection because we need complete real data .. if we have this it should be much easier for the recognition part we could create some synth data and eval on the already collected real data.

I think if we are able to collect the data up to end of january i could provide pre-labeling via Azure's Document AI.

Currently missing parts are:

handwritten (for the detection model - recognition is another story)
chinese (symbols)
hindi
bulgarian/ukrainian/russian/serbian (cyrillic)
special symbols (bullet points, etc.)
more latin based (spanish, czech, ..)
...

CC @odulcy-mindee

Lang list: https://github.com/eymenefealtun/all-words-in-all-languages

decadance-dance · 2024-10-10T08:00:43Z

@felixdittrich92, thank you for a detailed answer.
I'd help to collect data. It would be great if we can populate this initiative to our community. I think if everyone provides at least a couple of samples, then a good amount of data can be collected.
BTW, Is there any flow or established process for collecting and submitting data?

felixdittrich92 · 2024-10-10T09:56:09Z

@decadance-dance Not yet ..maybe the easiest would be to create a huggingface space for this because from this you could also do easily pictures from your smartphone and under the hood we push the taken or uploaded images into an HF dataset.

In this case we could also add an agreement before any data can be uploaded that the person who uploads agrees to have all rights on the image and uploads the image with the knowledge to provide the uploaded images openly to everyone who downloads the dataset.

Wdyt ?

Again CC @odulcy-mindee :D

ramSeraph · 2024-10-10T17:40:25Z

I found one possible dataset for printed documents for multiple languages. It is wikisource. They have text and images at the page level, originally created using some existing OCR(Google vision/tesseract) and the data has then been corrected/proofread by people. They have annotations to differentiate what has been proofread and what has not been. An example - https://te.wikisource.org/wiki/పుట%3AAandhrakavula-charitramu.pdf/439. The license would be CC-BY-SA and I am expecting them to only have pulled books for which copyright has expired. Collecting fonts for various languages is a bigger problem though( because of licenses ).

felixdittrich92 · 2024-10-17T14:37:18Z

Thanks @ramSeraph for sharing i will have a look 👍

@decadance-dance @nikokks

I created a space which can be used to collect some data (only raw data for starting) wdyt ?
https://huggingface.co/spaces/Felix92/docTR-multilingual-Datacollector

Later on if we say we have collected enough raw data we can filter the data and pre-label with Azure Document AI.

decadance-dance · 2024-10-21T07:21:41Z

Sounds good to me. Thanks чт, 17 окт. 2024 г. в 16:37, Felix Dittrich ***@***.***>:

…

Thanks @ramSeraph <https://github.com/ramSeraph> for sharing i will have a look 👍 @decadance-dance <https://github.com/decadance-dance> @nikokks <https://github.com/nikokks> I created a space which can be used to collect some data (only raw data for starting) wdyt ? https://huggingface.co/spaces/Felix92/docTR-multilingual-Datacollector Later on if we say we have collected enough raw data we can filter the data and pre-label with Azure Document AI. — Reply to this email directly, view it on GitHub <#1699 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AURNXMCDXAODUPBYE6BNUQLZ37DTLAVCNFSM6AAAAABMZ2E2AGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJZG4ZTIMRQGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

felixdittrich92 · 2024-11-01T09:20:49Z

@decadance-dance @nikokks @ramSeraph @allother

I created an request to the mindee team to provide support on this task.
https://mindee-community.slack.com/archives/C02HGHMUJH0/p1730452486444309

Would be nice if you could write a comment in the thread about your needs to support this 🙏

felixdittrich92 · 2024-11-12T07:57:53Z

First stage would be to improve the detection models, for the sec stage the recognition part we could generate additional synthetic data

felixdittrich92 · 2024-11-19T15:50:52Z

Short update here:

I collected ~30k samples containing:
~7k arabic
~1k hindi
~1k chinese
~1k thai
~4k cyrillic
~1k greek
~5k additional latin extended (polish, spanish, and so on)
(including ~15% handwritten - most russian, arabic and latin)
~10k receipts around the globe

Now i need to find a way to annotate all these data - AWS Textract & Azure Document AI failed as possible useful prelabeling solution

Best results reached with docTR/OnnxTR (only detection) - but still to much issues to include it directly into our dataset for pretraining.

decadance-dance · 2024-11-19T17:41:40Z

Now i need to find a way to annotate all these data - AWS Textract & Azure Document AI failed as possible useful prelabeling solution

Why did they faile?

felixdittrich92 · 2024-11-19T17:49:38Z

Now i need to find a way to annotate all these data - AWS Textract & Azure Document AI failed as possible useful prelabeling solution

Why did they faile?

Detection results was really worse for many samples

decadance-dance · 2024-11-19T17:51:09Z

For training such recognition models i don't see a problem.. we can generate synth train data and need in a best case only real val samples.

how do you think what way of generating synth word text is more beneficial?
a) use predefined vocab and randomly sample characters from it into a given range, like you are doing in _WordGenerator
b) use predefined text corpus and randomly sample entire words from it
c) combine (a) and (b)

decadance-dance · 2024-11-19T17:53:04Z

Detection results was really worse for many samples

How did you evaluate them? As I understood your data is not annotated yet.
Did you check samples manually?

decadance-dance · 2024-11-19T17:54:13Z

AWS Textract & Azure Document AI failed as possible useful prelabeling solution

maybe easy-ocr will work for you?

felixdittrich92 · 2024-11-19T18:36:34Z

For training such recognition models i don't see a problem.. we can generate synth train data and need in a best case only real val samples.

how do you think what way of generating synth word text is more beneficial? a) use predefined vocab and randomly sample characters from it into a given range, like you are doing in _WordGenerator b) use predefined text corpus and randomly sample entire words from it c) combine (a) and (b)

I would go with option b and augment a fixed part of this data (words) with low frequent characters (like the % symbol).

I did the same to train the multilingual parseq model :)

felixdittrich92 · 2024-11-19T18:43:17Z

I think the only option is to label a part of the data manually -> fine tune -> pre-label -> correct and again in an iterative process 🙈😅 (really time consuming)

murilosimao · 2025-02-03T19:39:32Z

I had an idea that could help speed things up when dealing with documents. What if there were a selectable database of PDFs or other documents (DOCX, PPTX) in the desired languages? Then, you could extract the text with certainty, convert the PDF into the desired image format with the required resolution/DPI, adjust the bounding boxes according to the resolutions and text, and voilà. I have around 80k selectable documents in Brazilian Portuguese (latin) and can start testing to see if this works.

felixdittrich92 · 2025-02-04T08:43:56Z

Hey @murilosimao 👋,

Yep sounds great feel free to update here if you have some results 👍

I will (hopefully soon) also discuss a strategy with @sebastianMindee

cyanic-selkie · 2025-04-18T10:07:37Z

Hi, I have a question regarding how exactly the multilingual support will be implemented. Other solutions currently have a different model for each script and no way to detect the script beforehand, so you need to know what is the script that you're OCR'ing.

Will the multilingual models simply support all supported scripts/languages, or will they also be split? If so, would you consider also training a script detector?

felixdittrich92 · 2025-04-18T10:39:44Z

Hi @cyanic-selkie 👋,

We have planned to train unique multilingual models currently we decided to go with 3 stages for this:

latin extended (en, fr, pt, es, de, etc incl. vietnamese) + cyrillic extended (ru, uk, bl, etc.) + ancient greek + hebrew - here we know already that this works pretty well
arabic script + hindi script
simplified chinese + japanese + korean (This will be the endboss 😅)

Currently we work getting everything together - fonts, complete vocabs, wordlists to generate enough synth data with an mindee internal tool

This will slightly slow done the inference latency - but I think it's much more user friendly and a benefit especially for multilingual documents

Later on to control this the vision is to implement a kind of blacklisting/whitelisting under the hood

from doctr.models import ocr_predictor
from doctr.datasets import VOCABS

model = ocr_predictor(pretrained=True, whitelist=VOCABS["russian"] + VOCABS["german"])

I tried already some possible ways (#1876 (comment)) but will have a look with @SiddhantBahuguna again

felixdittrich92 · 2025-04-18T10:41:18Z

Everyone is btw invited to help use with the vocabs completion in #1883 😄

felixdittrich92 · 2025-04-18T10:44:56Z

A latin extended model can already be found here:
docTR: https://huggingface.co/Felix92/doctr-torch-parseq-multilingual-v1
OnnxTR: https://huggingface.co/Felix92/onnxtr-parseq-multilingual-v1

That was started as an experiment on my own but people seems to like it 😅

cyanic-selkie · 2025-04-18T18:00:36Z

@felixdittrich92

This will slightly slow done the inference latency - but I think it's much more user friendly and a benefit especially for multilingual documents

Are you saying that the script detection (i.e., model selection between the 3) will be done on the fly, and hence the inference will be slower or?

Everyone is btw invited to help use with the vocabs completion in #1883 😄

Regarding the dictionaries, I've noticed that the linked repository (https://github.com/eymenefealtun/all-words-in-all-languages) is really not good. I can't talk about many other languages, but the Croatian one is extremely poor. Since many other languages have about the same number of words, I'm guessing it's the same situation.

May I suggest using Hunspell dictionaries. They're used everywhere for autocorrect (they can easily be extracted from LibreOffice for example if one is missing or is outdated in the available GitHub repos). A simple script could be written that generates all possible words given the dictionary and the form rules (there is the wordforms command that generates all possible forms of a given word).

For example, given the Croatian index.dic and index.aff, running this command (where the word optika is listed in the index.dic):

wordforms index.aff index.dic optika

Gives me all of these words:

optike
optika
optiku
optike
optiko
optikom
optikom
optikama
optika
optikama
optiku
optici

One issue with this is that it's all lowercased, is this a problem for the way you do OCR?

I could submit a PR with the script and then run it for all supported languages. I don't know what the source for the existing languages is, but I'd wager this would be higher quality.

Also, for languages that don't have an available Hunspell dictionary, fineweb-2 was recently released. Much care was given to the quality of the data, particularly deduplication, which is very nice for frequency thresholding words. Also, it's available in 1000+ languages (although many of them have very little data), so theoretically you could support quite a lot of languages, albeit with a somewhat dirtier dataset.

felixdittrich92 · 2025-04-18T18:29:46Z

@cyanic-selkie
We don't need any script detection in this case because the recognition models will be able to recognize any character it was trained on independent from the language :)

About the inference latency it depends on the solution we have to find for the black-white/-listing - but I don't expect a high latency increase

Mh no this should be fine thanks for sharing we could randomly uppercase the first letter and augment with underrated chars👍

Would you like to add the croatian vocab to our predefined ones ? :) See #1883

felixdittrich92 · 2025-04-18T18:32:28Z

CC @sebastianMindee That's a really good point if required we could extract word lists / vocabs from the fineweb dataset

cyanic-selkie · 2025-04-18T18:46:16Z

@felixdittrich92

We don't need any script detection in this case because the recognition models will be able to recognize any character it was trained on independent from the language :)

Yeah, but I'm saying if I want to seamlessly support all three (CJK + arabic/hindi + others), I would have to detect the script beforehand.

felixdittrich92 · 2025-04-19T10:59:18Z

@felixdittrich92

We don't need any script detection in this case because the recognition models will be able to recognize any character it was trained on independent from the language :)

Yeah, but I'm saying if I want to seamlessly support all three (CJK + arabic/hindi + others), I would have to detect the script beforehand.

Why ? 😅
Was maybe not 100% clear from the prev answer but if you keep the blacklisting empty it will recognize any character independent from the language (that's the plan)
Or do you think about the reading order (right-to-left and revers) ?

cyanic-selkie · 2025-04-19T13:23:21Z

I'm so confused right now, sorry 😢

After rereading your original post I noticed you said "3 stages" not "3 models", but you also said "unique multilingual models".

So, the clarification I need here is:

Will a single recognition model be able to handle all scripts/languages (i.e., be trained on all scripts/languages); and the "stages" you mention are referring to how the support for different scripts will grow over time for that single multilingual model?

Would you like to add the croatian vocab to our predefined ones ? :) See #1883

Yes, I'll make a PR for Croatian and other BCMS languages at the very least.

felixdittrich92 · 2025-04-19T13:38:46Z

Will a single recognition model be able to handle all scripts/languages (i.e., be trained on all scripts/languages); and the "stages" you mention are referring to how the support for different scripts will grow over time for that single multilingual model?

Correct

With models I meant because we have different architectures for example: parseq, vitstr, crnn, etc. ^^
But yeah all these models should be able at the end of training stage 3 to recognize latin_ext, cyrillic_ext, hindi, arabic, chinese, ...

felixdittrich92 · 2025-04-22T06:47:08Z

@cyanic-selkie I created a HF Space which can be used to upload wordlists & fonts
https://huggingface.co/spaces/Felix92/docTR-resources-collection

We will filter out the data we need later on
CC @sebastianMindee

cyanic-selkie · 2025-04-23T17:30:57Z

@felixdittrich92 I had some issues with rate limits for whatever reason, but I managed to upload the Croatian wordlist. I didn't realize the naming convention was wordlist-lang.txt instead of wordlist_lang.txt, so you might want to rename it.

felixdittrich92 · 2025-04-24T06:35:58Z

@felixdittrich92 I had some issues with rate limits for whatever reason, but I managed to upload the Croatian wordlist. I didn't realize the naming convention was wordlist-lang.txt instead of wordlist_lang.txt, so you might want to rename it.

Yeah 😅 I upload lots of fonts atm so the rate limit will be on it's limit the next days - Yeah that's fine thanks a lot 👍

felixdittrich92 · 2025-04-24T07:13:54Z

I also updated the list in #1883 so if anyone can help here this would be awesome 🙏

felixdittrich92 · 2025-05-14T09:10:14Z

Short update here:

I cracked it, a first multilingual experimental model was trained for the following languages:

latin_based = [
    "english", "albanian", "afrikaans", "azerbaijani", "basque", "bosanski",
    "catalan", "croatian", "czech", "danish", "dutch", "estonian", "esperanto",
    "french", "legacy_french", "finnish", "frisian", "galician", "german",
    "hausa", "hungarian", "icelandic", "indonesian", "irish", "italian",
    "latvian", "lithuanian", "luxembourgish", "malagasy", "malay", "maltese",
    "maori", "montenegrin", "norwegian", "polish", "portuguese", "quechua",
    "romanian", "scottish_gaelic", "serbian_latin", "slovak", "slovene",
    "somali", "spanish", "swahili", "swedish", "tagalog", "turkish",
    "uzbek_latin", "vietnamese", "welsh", "yoruba", "zulu",
]

hebrew = ["hebrew"]
greek = ["greek"]

cyrillic_based = [
    "russian", "belarusian", "ukrainian", "tatar", "tajik", "kazakh", "kyrgyz",
    "bulgarian", "macedonian", "mongolian", "yakut", "serbian_cyrillic",
    "uzbek_cyrillic",
]

mostly no confusion

Next experiments starting soon including:

arabic_based = ["arabic", "persian", "urdu", "kurdish", "pashto", "uyghur", "sindhi"]
indic_based = ["devanagari", "hindi", "bangla", "gujarati", "tamil", "telugu", "kannada", "sinhala", "malayalam", "punjabi", "odia"]
thai_based = ["thai", "lao", "khmer"]
others = ["armenian", "sudanese", "ethiopic", "georgian", "burmese", "javanese"]

cyanic-selkie · 2025-05-14T09:42:39Z

@felixdittrich92 Awesome! What are some ways I can further contribute to the multilingual effort? What do you expect is a timeline for a production ready model?

On an unrelated note, is there any support or do you plan to support quantization aware training followed by ONNX export. I've had great success using this workflow with some CNNs before running in int8 on edge devices with the xnnpack execution provider. I could also contribute in those areas if you're interested.

felixdittrich92 · 2025-05-14T10:00:45Z

Hey @cyanic-selkie,

To further improve robustness, we need to explore a way to implement "whitelisting" for character constraints. It would be best to discuss the details on Slack or LinkedIn:

Future Plan

We’re considering adding a whitelist parameter to the predictor:

ocr_predictor(pretrained=True, whitelist=VOCABS["german"] + VOCABS["hebrew"] + "ABc")

This would require introducing a ConstrainedBeamSearchPostProcessor, which would be used in place of the current default when a whitelist is provided. The implementation is quite involved, though - see the Hugging Face reference for inspiration:
https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam\_search.py

Ideally, we aim to finalize this feature - including pretrained recognition models for all architectures - by fall/winter. That timeline will depend a bit on the support from @SiddhantBahuguna and @sebastianMindee.

On Quantization-Aware Training

It would also be great to integrate QAT into our training scripts!
Currently, the INT8 models in OnnxTR are only post calibrated, mainly because I haven’t had a solid dataset for proper QAT… yet 😅

And I still need fonts for the simplified_chinese vocab which supports all "chars/symbols"

decadance-dance added the type: enhancement Improvement label Aug 20, 2024

decadance-dance closed this as completed Aug 20, 2024

felixdittrich92 reopened this Aug 20, 2024

felixdittrich92 added this to the 1.0.0 milestone Oct 10, 2024

felixdittrich92 mentioned this issue Oct 10, 2024

Hindi Language support #1617

Closed

felixdittrich92 pinned this issue Oct 11, 2024

felixdittrich92 mentioned this issue Jan 28, 2025

Release tracker - v0.12.0 #1856

Open

4 tasks

felixdittrich92 modified the milestones: 1.0.0, 0.12.0 Jan 28, 2025

felixdittrich92 self-assigned this Jan 30, 2025

felixdittrich92 mentioned this issue Feb 26, 2025

VitStr model for spanish language #1874

Closed

felixdittrich92 assigned sebastianMindee and SiddhantBahuguna Mar 5, 2025

felixdittrich92 mentioned this issue Mar 11, 2025

Add arabic language #1894

Closed

Multilingual support #1699

Multilingual support #1699

Comments

decadance-dance commented Aug 20, 2024

🚀 The feature

Motivation, pitch

Alternatives

Additional context

felixdittrich92 commented Aug 20, 2024

decadance-dance commented Aug 20, 2024

felixdittrich92 commented Aug 20, 2024

felixdittrich92 commented Aug 21, 2024

odulcy-mindee commented Aug 27, 2024

felixdittrich92 commented Aug 27, 2024 • edited Loading

nikokks commented Sep 6, 2024 • edited Loading

nikokks commented Sep 6, 2024 • edited Loading

felixdittrich92 commented Sep 6, 2024

decadance-dance commented Oct 9, 2024

felixdittrich92 commented Oct 10, 2024 • edited Loading

decadance-dance commented Oct 10, 2024

felixdittrich92 commented Oct 10, 2024 • edited Loading

ramSeraph commented Oct 10, 2024

felixdittrich92 commented Oct 17, 2024

decadance-dance commented Oct 21, 2024 via email

felixdittrich92 commented Nov 1, 2024

felixdittrich92 commented Nov 12, 2024 • edited Loading

felixdittrich92 commented Nov 19, 2024 • edited Loading

decadance-dance commented Nov 19, 2024

felixdittrich92 commented Nov 19, 2024

decadance-dance commented Nov 19, 2024

decadance-dance commented Nov 19, 2024

decadance-dance commented Nov 19, 2024

felixdittrich92 commented Nov 19, 2024

felixdittrich92 commented Nov 19, 2024

murilosimao commented Feb 3, 2025

felixdittrich92 commented Feb 4, 2025

cyanic-selkie commented Apr 18, 2025

felixdittrich92 commented Apr 18, 2025

felixdittrich92 commented Apr 18, 2025

felixdittrich92 commented Apr 18, 2025

cyanic-selkie commented Apr 18, 2025 • edited Loading

felixdittrich92 commented Apr 18, 2025 • edited Loading

felixdittrich92 commented Apr 18, 2025

cyanic-selkie commented Apr 18, 2025

felixdittrich92 commented Apr 19, 2025

cyanic-selkie commented Apr 19, 2025 • edited Loading

felixdittrich92 commented Apr 19, 2025

felixdittrich92 commented Apr 22, 2025

cyanic-selkie commented Apr 23, 2025 • edited Loading

felixdittrich92 commented Apr 24, 2025

felixdittrich92 commented Apr 24, 2025

felixdittrich92 commented May 14, 2025

cyanic-selkie commented May 14, 2025

felixdittrich92 commented May 14, 2025

Future Plan

On Quantization-Aware Training

felixdittrich92 commented Aug 27, 2024 •

edited

Loading

nikokks commented Sep 6, 2024 •

edited

Loading

nikokks commented Sep 6, 2024 •

edited

Loading

felixdittrich92 commented Oct 10, 2024 •

edited

Loading

felixdittrich92 commented Oct 10, 2024 •

edited

Loading

felixdittrich92 commented Nov 12, 2024 •

edited

Loading

felixdittrich92 commented Nov 19, 2024 •

edited

Loading

cyanic-selkie commented Apr 18, 2025 •

edited

Loading

felixdittrich92 commented Apr 18, 2025 •

edited

Loading

cyanic-selkie commented Apr 19, 2025 •

edited

Loading

cyanic-selkie commented Apr 23, 2025 •

edited

Loading