[datasets] Massively extend the pre-defined vocabs #1928

felixdittrich92 · 2025-04-30T14:08:25Z

This PR:

Extend the pre-defined vocabs
Sort datasets.rst to the same order we have in vocabs.py

NOTE:

I evaluated that every single character has a font (google fonts) which supports it and additionally that the char can be rendered correctly - and does not render a 𱃱 - "placeholder" char

CC @sebastianMindee This was a f***ing pain .. especially the non cyrillic / latin ones 😅 - list is created with an combination of chatgpt + gemini + lots of manual / semi-automated evaluation (wikipedia & unicode table & other sources)

Total unique characters in VOCABS: 20814
VOCABS characters supported by at least one font: 20814
VOCABS characters NOT supported by any font: 0

Overwrites: #1925 #1926

Closes: #1883

codecov · 2025-04-30T14:41:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.83%. Comparing base (559279d) to head (dd00b4c).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1928      +/-   ##
==========================================
+ Coverage   96.76%   96.83%   +0.06%     
==========================================
  Files         172      172              
  Lines        8442     8520      +78     
==========================================
+ Hits         8169     8250      +81     
+ Misses        273      270       -3

Flag	Coverage Δ
unittests	`96.83% <100.00%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

felixdittrich92 · 2025-05-11T13:55:03Z

@sebastianMindee

should be mostly fine (I think east asian vocabs needs some adjustments in a follow up PR)

felixdittrich92 · 2025-05-13T03:34:23Z

Closes: #1935

docs/source/modules/datasets.rst

felixdittrich92 added topic: documentation Improvements or additions to documentation type: enhancement Improvement module: datasets Related to doctr.datasets topic: text recognition Related to the task of text recognition ext: docs Related to docs folder labels Apr 30, 2025

felixdittrich92 added this to the 0.12.0 milestone Apr 30, 2025

felixdittrich92 requested a review from sebastianMindee April 30, 2025 14:08

felixdittrich92 self-assigned this Apr 30, 2025

This was referenced Apr 30, 2025

[datasets] Add several cyrillic based vocabs #1925

Closed

[datasets] Extend latin based vocabs #1926

Closed

felixdittrich92 mentioned this pull request May 13, 2025

Ancient Greek characters #1935

Open

felixdittrich92 added 13 commits May 13, 2025 10:17

update

935d2a7

update

54df47b

update

c9baadf

update

411a6c0

update

f00922f

fix uppercase

52d902c

Add missing currency signs

0c44d0f

Fix hebrew & frisian & greek

2ce3258

Update multilingual vocab up to hebrew & add 3 missing

6b24643

minor fix

90b20f0

Apply greek vocab suggestions

ae35513

minor fix

6c6c3c1

rebase

bb1a64a

felixdittrich92 force-pushed the other-vocabs branch from 089f4e9 to bb1a64a Compare May 13, 2025 08:18

felixdittrich92 added 2 commits May 14, 2025 05:52

Update east asian vocabs

d5e2983

Update east asian vocabs

57c9eec

cyanic-selkie reviewed May 14, 2025

View reviewed changes

docs/source/modules/datasets.rst Outdated Show resolved Hide resolved

Correct bosanski to bosnian

dd00b4c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datasets] Massively extend the pre-defined vocabs #1928

[datasets] Massively extend the pre-defined vocabs #1928

felixdittrich92 commented Apr 30, 2025 •

edited

Loading

codecov bot commented Apr 30, 2025 •

edited

Loading

felixdittrich92 commented May 11, 2025

felixdittrich92 commented May 13, 2025

[datasets] Massively extend the pre-defined vocabs #1928

Are you sure you want to change the base?

[datasets] Massively extend the pre-defined vocabs #1928

Conversation

felixdittrich92 commented Apr 30, 2025 • edited Loading

codecov bot commented Apr 30, 2025 • edited Loading

Codecov Report

felixdittrich92 commented May 11, 2025

felixdittrich92 commented May 13, 2025

felixdittrich92 commented Apr 30, 2025 •

edited

Loading

codecov bot commented Apr 30, 2025 •

edited

Loading