Ancient Greek characters #1935

metheofanis · 2025-05-11T10:18:49Z

It looks like most of the Ancient Greek characters are missing from vocabs.py
Do I miss something?
I attach below the full set.

grc_chars.txt

felixdittrich92 · 2025-05-11T11:20:10Z

Hi @metheofanis 👋,

Thanks for pointing this out 👍
Is this any specific "extension" ? Because I wasn't able to find anything related to the char list you shared online.

24 basic letters + 2 forms for sigma + upsilon + Xi <- that's what all resources share

felixdittrich92 · 2025-05-11T13:39:42Z

This https://en.wikipedia.org/wiki/Greek_Extended ?

metheofanis · 2025-05-12T11:10:45Z

Yes, exactly. Greek Extended is the correct Unicode range, and the character set!

metheofanis · 2025-05-12T11:37:55Z

If you also need a wordlist, I could provide.

felixdittrich92 · 2025-05-12T11:57:09Z

@metheofanis This would be good indeed 👍
On the other hand I think this will not work at the end as expected

For example:

ῢ
ΰ

There is mostly no visual difference - same for other chars 🤔

So we can at it as vocab but then I would suggest to provide it as extra instead of merging the current greek vocab with the new chars together ..
Something like: VOCABS["greek_extended"] = VOCABS["greek"] + "..."

metheofanis · 2025-05-12T12:22:40Z

Yes you are right. It is almost the same, for a lot of characters.
The difference it at the accent marks.
If you notice,
ῢ : Has 2 dots and the accent mark in the middle is like \ backslash (βαρεία)
ΰ : Has 2 dots and the accent mark in the middle is like / forward slash (οξεία)
So the difference is the middle accent line.
I think the basic "Greek" should include the simple accented chars that are used in modern Greek. (ά, έ, ή, ί, ό, ύ, ϋ, ΰ, ώ and their capitals)

It looks like we need a specialized engine for the Ancient (Polytonic) Greek. One process to find the character and another process to find the correct accent marks.

Since I'm not expert in OCR or in programming. Please do whatever you think good. I'll be glad to help.

The file I gave above has duplicates removed. They are some characters that they are defined in Greek and in Greek Extended. They will be always the same visually.
Like : ά (03AC in Greek range) and ά (1F71 in Greek Extended range)
I attach below the list.

Greek chars duplicates.docx

felixdittrich92 · 2025-05-12T13:11:36Z

ok the complete would then be:

VOCABS["greek_extended"] = VOCABS["greek"] + "ͶͷΆΈΉΊΌΎΏΐΪΫάέήίΰϊϋόύώϜϝἀἁἂἃἄἅἆἇἈἉἊἋἌἍἎἏἐἑἒἓἔἕἘἙἚἛἜἝἠἡἢἣἤἥἦἧἨἩἪἫἬἭἮἯἰἱἲἳἴἵἶἷἸἹἺἻἼἽἾἿὀὁὂὃὄὅὈὉὊὋὌὍὐὑὒὓὔὕὖὗὙὛὝὟὠὡὢὣὤὥὦὧὨὩὪὫὬὭὮὯὰὲὴὶὸὺὼᾀᾁᾂᾃᾄᾅᾆᾇᾈᾉᾊᾋᾌᾍᾎᾏᾐᾑᾒᾓᾔᾕᾖᾗᾘᾙᾚᾛᾜᾝᾞᾟᾠᾡᾢᾣᾤᾥᾦᾧᾨᾩᾪᾫᾬᾭᾮᾯᾲᾳᾴᾶᾷᾺᾼῂῃῄῆῇῈῊῌῒΐῖῗῚῢΰῤῥῦῧῪῬῲῳῴῶῷῸῺῼ"

extracted from your .txt file

And άέήίόύώΆΈΉΊΌΎΏ are the chars we should add in any case for modern greek ?

metheofanis · 2025-05-12T15:27:42Z

Some thoughts:

The chars to add to the modern Greek should also include: ϋ, ΰ, ϊ, ΐ and the capitals Ϋ, Ϊ
So the full list is άέήίϊΐόύϋΰώΆΈΉΊΪΌΎΫΏ
I also attach the new Greek extended only chars. I've removed the above accented chars

Greek Extended only.txt

felixdittrich92 · 2025-05-13T03:37:40Z

Hi @metheofanis 👋 ,

I updated the greek vocab in #1928 and added an extra entry for the full extended version (greek extended) but as already mentioned only included the "basic" greek vocab for now 👍

metheofanis · 2025-05-13T08:30:12Z

Does this mean that it will be able to OCR Greek Extended documents?

felixdittrich92 · 2025-05-13T08:40:25Z

We work atm on making docTR multilingual (training the recognition models on a multilingual dataset) and as mentioned we would include the "basic" greek vocab here in a first step.

For the first stage it's planned:
latin_ext + cyrillic_ext + greek + hebrew

if we see that this works well we could consider to include greek_ext instead

stage 2: arabic + hindi + brahmic script
stage 3: japanese + simplified chinese + korean

felixdittrich92 mentioned this issue May 13, 2025

[datasets] Massively extend the pre-defined vocabs #1928

Open

felixdittrich92 self-assigned this May 13, 2025

felixdittrich92 added type: enhancement Improvement module: datasets Related to doctr.datasets labels May 13, 2025

felixdittrich92 added this to the 0.12.0 milestone May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ancient Greek characters #1935

Ancient Greek characters #1935

metheofanis commented May 11, 2025

felixdittrich92 commented May 11, 2025

felixdittrich92 commented May 11, 2025

metheofanis commented May 12, 2025

metheofanis commented May 12, 2025

felixdittrich92 commented May 12, 2025 •

edited

Loading

metheofanis commented May 12, 2025

felixdittrich92 commented May 12, 2025

metheofanis commented May 12, 2025

felixdittrich92 commented May 13, 2025

metheofanis commented May 13, 2025

felixdittrich92 commented May 13, 2025

Ancient Greek characters #1935

Ancient Greek characters #1935

Comments

metheofanis commented May 11, 2025

felixdittrich92 commented May 11, 2025

felixdittrich92 commented May 11, 2025

metheofanis commented May 12, 2025

metheofanis commented May 12, 2025

felixdittrich92 commented May 12, 2025 • edited Loading

metheofanis commented May 12, 2025

felixdittrich92 commented May 12, 2025

metheofanis commented May 12, 2025

felixdittrich92 commented May 13, 2025

metheofanis commented May 13, 2025

felixdittrich92 commented May 13, 2025

felixdittrich92 commented May 12, 2025 •

edited

Loading