Skip to content

[datasets] Massively extend the pre-defined vocabs #1928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

felixdittrich92
Copy link
Contributor

@felixdittrich92 felixdittrich92 commented Apr 30, 2025

This PR:

  • Extend the pre-defined vocabs
  • Sort datasets.rst to the same order we have in vocabs.py

NOTE:

  • I evaluated that every single character has a font (google fonts) which supports it and additionally that the char can be rendered correctly - and does not render a 𱃱 - "placeholder" char

CC @sebastianMindee This was a f***ing pain .. especially the non cyrillic / latin ones 😅 - list is created with an combination of chatgpt + gemini + lots of manual / semi-automated evaluation (wikipedia & unicode table & other sources)

Total unique characters in VOCABS: 20814
VOCABS characters supported by at least one font: 20814
VOCABS characters NOT supported by any font: 0

Overwrites: #1925 #1926

Closes: #1883

@felixdittrich92 felixdittrich92 added topic: documentation Improvements or additions to documentation type: enhancement Improvement module: datasets Related to doctr.datasets topic: text recognition Related to the task of text recognition ext: docs Related to docs folder labels Apr 30, 2025
@felixdittrich92 felixdittrich92 added this to the 0.12.0 milestone Apr 30, 2025
@felixdittrich92 felixdittrich92 self-assigned this Apr 30, 2025
Copy link

codecov bot commented Apr 30, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.83%. Comparing base (559279d) to head (dd00b4c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1928      +/-   ##
==========================================
+ Coverage   96.76%   96.83%   +0.06%     
==========================================
  Files         172      172              
  Lines        8442     8520      +78     
==========================================
+ Hits         8169     8250      +81     
+ Misses        273      270       -3     
Flag Coverage Δ
unittests 96.83% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@felixdittrich92
Copy link
Contributor Author

@sebastianMindee

should be mostly fine (I think east asian vocabs needs some adjustments in a follow up PR)

@felixdittrich92
Copy link
Contributor Author

Closes: #1935

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: docs Related to docs folder module: datasets Related to doctr.datasets topic: documentation Improvements or additions to documentation topic: text recognition Related to the task of text recognition type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[vocabs] Extend the list of predefined vocabularies
2 participants