Training SAR on Portuguese #565

fmobrj · 2021-10-30T10:36:27Z

fmobrj
Oct 30, 2021

Thanks for your great work and library.

I am impressed with the accuracy of your trained recognition models on portuguese texts. However, Portuguese has some different accentuation from french. So, I am creating a Portuguese synthetic dataset (10MM words) using Synthtiger https://arxiv.org/pdf/2107.09313v1.pdf and will try to train a portuguese recognition model.

So, I have some doubts about how you trained the french SAR model. In the SAR paper, the author looped through groups of dataset samples from both real and synthetic images. My doubt: did you followed the paper steps or usend only a french synthetic dataset? What was the size you used and for how many epochs did you train. Once your results were superb, I think it would be a good start for a portuguese model to emulate what you did.

Best regards,
Fabio.

Answered by fg-mindee

Oct 30, 2021

Hello @fmobrj 👋

Thanks a lot for your previous PR and your contribution to Portuguese support in docTr!

So for your to train a SAR model in your dataset, here is what you need to do:

Format your dataset structure & format to match this: https://github.com/mindee/doctr/blob/main/references/recognition/README.md#data-format
Make sure you have properly setup your PyTorch or TensorFlow backend
Install docTR
Clone the repo and run the text recognition training script for your DL framework as per https://github.com/mindee/doctr/blob/main/references/recognition/README.md#usage (you can specify the vocab to use with --vocab portuguese)

Regarding the training hyperparameters, our training set is…

View full answer

fg-mindee · 2021-10-30T11:18:53Z

fg-mindee
Oct 30, 2021

Hello @fmobrj 👋

Thanks a lot for your previous PR and your contribution to Portuguese support in docTr!

So for your to train a SAR model in your dataset, here is what you need to do:

Format your dataset structure & format to match this: https://github.com/mindee/doctr/blob/main/references/recognition/README.md#data-format
Make sure you have properly setup your PyTorch or TensorFlow backend
Install docTR
Clone the repo and run the text recognition training script for your DL framework as per https://github.com/mindee/doctr/blob/main/references/recognition/README.md#usage (you can specify the vocab to use with --vocab portuguese)

Regarding the training hyperparameters, our training set is about the same size as yours so I'd recommend using the training script default for now apart from the number of epochs (increase it to 15 or 20).

Also please note, that while the SAR performs well, it is certainly much slower than the CRNN. I would suggest considering training a crnn_mobilenet_v3_large rather than sar_resnet31, the difference in size and inference speed is significant 😅

Let me know if you have any questions :)

0 replies

fmobrj · 2021-10-30T11:23:52Z

fmobrj
Oct 30, 2021
Author

Wow. Thank you very much. I will try the crnn and will report my results here as soon as I progress!

Best regards, Fabio.

7 replies

fmobrj Oct 31, 2021
Author

Hi, @fg-mindee!

Strange. If you see the content of my PR, it is different than what is in the repository.

My PR:

VOCABS: Dict[str, str] = {
    'digits': string.digits,
    'ascii_letters': string.ascii_letters,
    'punctuation': string.punctuation,
    'currency': '£€¥¢฿',
    'latin': string.digits + string.ascii_letters + string.punctuation + '°',
    'french': string.digits + string.ascii_letters + string.punctuation + '°' + 'àâéèêëîïôùûçÀÂÉÈËÎÏÔÙÛÇ' + '£€¥¢฿',
    'portuguese': string.digits + string.ascii_letters + string.punctuation + '°' + 'àâáãéêíïóôõúüçÀÂÃÁÉÊÍÏÔÓÕÚÜÇ' + '£€¥¢฿',
}

https://github.com/mindee/doctr/pull/464/files/ba76b8a1fcb9f39cda8e37fb09e73e8064480ae3#diff-a1521f79b7ee0d99c9c14e15763ad4f66fa65b590ee5cba26aab39a12d0ee098

And refactored to delete superscipt "a":

b2722d7

Repository:

https://github.com/mindee/doctr/blob/main/doctr/datasets/vocabs.py

VOCABS['portuguese'] = VOCABS['english'] + 'áàâãéêëíïóôõúüçÁÀÂÃÉËÍÏÓÔÕÚÜÇ' + '¡¿'

I will look for an official portuguese alphabet source and send it here.

Best regards,
Fabio.

fmobrj Oct 31, 2021
Author

We had 2 portuguese language agreements where every lusophone country participated in standardizing the portuguese language. One in 1945 and another in 1990.

The most sucint and objective source I found is Wikipedia: https://pt.wikipedia.org/wiki/Alfabeto_portugu%C3%AAs

So the basic letters of the alphabet are set by this last agreement that reinstated k, y and w to the alphabet.
The accents are described in the "Acentos gráficos e sinais diacríticos" section in the page.
The punctuation can be found in: https://pt.wikipedia.org/wiki/Pontua%C3%A7%C3%A3o

You can check the formal ortography proposal on the site of Academia de Ciências de Lisboa and in the site of Academia Brasileira de Letras. They have a renown and more formal source of everything related to formal portuguese.

https://volp-acl.pt/index.php/ortografia

https://www.academia.org.br/nossa-lingua/formulario-ortografico

The text is not very objective. But everything is there.

Best regards,
Fabio.

fg-mindee Oct 31, 2021

Oh about the portuguese entry, I think it's my fault when I refactored the vocabs in #467!
Would you like to open PR to fix it? I can take a look on Tuesday otherwise 👍

fmobrj Oct 31, 2021
Author

No problem. Tomorrow I will try to open a PR. I hope I manage do it right again. It will be my 2nd PR to date.

Another doubt. Once French and Portuguese are so alike, do you think it is a good idea to finetune your checkpoint with portuguese data, once the main difference are some accents? Or will be very unbalanced in desfavour of the new accents and it is better to train from scratch?

Best regards,
Fabio.

fg-mindee Nov 1, 2021

It will be my 2nd PR to date.

Yes indeed, step by step entering the open source community :)

do you think it is a good idea to finetune your checkpoint with portuguese data, once the main difference are some accents?

It certainly is, if you want to make it even more efficient, there might be some possibilities to remap the final layers between vocabs. But for now I suggest retraining the final layers is already quite good.

fmobrj · 2021-11-19T11:28:32Z

fmobrj
Nov 19, 2021
Author

Hi, @fg-mindee!

Would you mind share the train / test split % you used? I will train using 10MM images I created using synthtiger script. But I am wondering if 20% (2MM) or even 10% (1MM) for validation is not too much. What do you think? Any suggestions?

A second question: do you think it would lead to better results to use differential learning rates when finetuning the french checkpoint using the portuguese dataset and vocabs? For eg.: default lr for the new linear layer and lr/10 for the rest of the model parameters.

Best regards,
Fabio.

6 replies

fmobrj Nov 23, 2021
Author

Thanks, @fg-mindee !

Surprisingly, training from scratch is showing better results compared to finetuning the french model without using differential learning rates. But I still cannot get the output quality of your french model (my results are ok / good, but not as great as I expected, a little bit worse than the french model).

My best result was a val loss of 0.2469, with 0.817379 for the exact match and 0.852275 for patial match. Of course these results are only comparable between my experiments, since my dataset is unique (10MM images synthesized using Synthtiger).

Now I am experimenting with other training strategies (ranger + flat cos ascheduler from scratch). The next step is to try to finetune the french model using the best opt and scheduler from previous experiments and using differential learning rates.

I will keep you updated.

Best regards.

fg-mindee Nov 23, 2021

About from scratch vs. pretraining: from scratch is expected to perform better, but this means you can't freeze anything so it's tougher on your hardware and takes longer to train. Additionally, it might be harder to reach convergence compared to using transfer learning.

It's already quite interesting that you managed to reach 80%+ of exact match (note that this metric is pretty hard) on a synthesized dataset. Really looking forward to your results 🙌

Feel free to ping us if you need any help 👌

fmobrj Nov 24, 2021
Author

Thanks @fg-mindee! I am still training with ranger + flat cos and the result keep improving in validation set. Now, in 8th epoch, I get this results: Epoch 8/20 - Validation loss: 0.226993 (Exact: 82.85% | Partial: 86.28%).

However, when applying to text documents, the results are pretty good, but still worse than your pretrained model. I suspect that it has something to do with using Synthtiger. Despite some improvements brought by synthtiger, especially for scene recognition, all the images are colored, while 9MM of the images of MJSynth are grayscale (I dont know if you are using MJSynth method to produce your training data). Maybe if I apply a random transform to convert lets say, 50%-80% of the images to grayscale, maybe I can improve the results for text documents.

I will try this after this run is over.

Best regards.

fmobrj Dec 6, 2021
Author

Hi, @fg-mindee! Finally finished my first experiments training a crnn_mobilenet_v3_large using a 10MM portuguese synthetic dataset I created using Synthtiger. My best results were 0.1822 for the validation loss, 0.858631 for the exact macth and 0.889008 for the partial match. Despite the good indicators for the validation data, when applying to real portuguese documents the results are pretty good, but still not as good as the french model. Next I will try to incorporate some real document data to my synthetic dataset, using the method you suggested in another thread. I was thinking of using AWS textract and use only 0.9+ confidence recognized texts and merge these samples with my Synthetic dataset. What do you think? I hope I can improve the results for real documents.

fg-mindee Dec 7, 2021

Still exciting results @fmobrj 👏

Yes that's a good idea, but it won't be cheap using AWS services. Depending on your budgetary constraints, you could go for open source solutions. Also, bear in mind that if you use annotations produced by a third party model: you will have a glass roof in terms of performances, that is almost the same as the one of that model :/

But that is certainly the quickest way to improve robustness. On our side, we'll gradually move from basic synthetic images, to more realistically augmented synthetic images so that we have positive to have perfect labels 👍 (but we're not talking about this month, cf. #262)

Training SAR on Portuguese #565

Uh oh!

fmobrj Oct 30, 2021

Replies: 3 comments · 13 replies

Uh oh!

Uh oh!

fg-mindee Oct 30, 2021

Uh oh!

fmobrj Oct 30, 2021 Author

Uh oh!

Uh oh!

fmobrj Oct 31, 2021 Author

Uh oh!

Uh oh!

fmobrj Oct 31, 2021 Author

Uh oh!

fg-mindee Oct 31, 2021

Uh oh!

fmobrj Oct 31, 2021 Author

Uh oh!

fg-mindee Nov 1, 2021

Uh oh!

Uh oh!

fmobrj Nov 19, 2021 Author

Uh oh!

Uh oh!

fmobrj Nov 23, 2021 Author

Uh oh!

fg-mindee Nov 23, 2021

Uh oh!

Uh oh!

fmobrj Nov 24, 2021 Author

Uh oh!

Uh oh!

fmobrj Dec 6, 2021 Author

Uh oh!

fg-mindee Dec 7, 2021

fmobrj
Oct 30, 2021

Replies: 3 comments 13 replies

fg-mindee
Oct 30, 2021

fmobrj
Oct 30, 2021
Author

fmobrj Oct 31, 2021
Author

fmobrj Oct 31, 2021
Author

fmobrj Oct 31, 2021
Author

fmobrj
Nov 19, 2021
Author

fmobrj Nov 23, 2021
Author

fmobrj Nov 24, 2021
Author

fmobrj Dec 6, 2021
Author