Skip to content

Ancient Greek characters #1935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
metheofanis opened this issue May 11, 2025 · 11 comments
Open

Ancient Greek characters #1935

metheofanis opened this issue May 11, 2025 · 11 comments
Assignees
Labels
module: datasets Related to doctr.datasets type: enhancement Improvement
Milestone

Comments

@metheofanis
Copy link

It looks like most of the Ancient Greek characters are missing from vocabs.py
Do I miss something?
I attach below the full set.

grc_chars.txt

@felixdittrich92
Copy link
Contributor

Hi @metheofanis 👋,

Thanks for pointing this out 👍
Is this any specific "extension" ? Because I wasn't able to find anything related to the char list you shared online.

24 basic letters + 2 forms for sigma + upsilon + Xi <- that's what all resources share

@felixdittrich92
Copy link
Contributor

@metheofanis
Copy link
Author

Yes, exactly. Greek Extended is the correct Unicode range, and the character set!

@metheofanis
Copy link
Author

If you also need a wordlist, I could provide.

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented May 12, 2025

@metheofanis This would be good indeed 👍
On the other hand I think this will not work at the end as expected

For example:

ῢ
ΰ

There is mostly no visual difference - same for other chars 🤔

So we can at it as vocab but then I would suggest to provide it as extra instead of merging the current greek vocab with the new chars together ..
Something like: VOCABS["greek_extended"] = VOCABS["greek"] + "..."

@metheofanis
Copy link
Author

Yes you are right. It is almost the same, for a lot of characters.
The difference it at the accent marks.
If you notice,
ῢ : Has 2 dots and the accent mark in the middle is like \ backslash (βαρεία)
ΰ : Has 2 dots and the accent mark in the middle is like / forward slash (οξεία)
So the difference is the middle accent line.
I think the basic "Greek" should include the simple accented chars that are used in modern Greek. (ά, έ, ή, ί, ό, ύ, ϋ, ΰ, ώ and their capitals)

It looks like we need a specialized engine for the Ancient (Polytonic) Greek. One process to find the character and another process to find the correct accent marks.

Since I'm not expert in OCR or in programming. Please do whatever you think good. I'll be glad to help.

The file I gave above has duplicates removed. They are some characters that they are defined in Greek and in Greek Extended. They will be always the same visually.
Like : ά (03AC in Greek range) and ά (1F71 in Greek Extended range)
I attach below the list.

Greek chars duplicates.docx

@felixdittrich92
Copy link
Contributor

ok the complete would then be:

VOCABS["greek_extended"] = VOCABS["greek"] + "ͶͷΆΈΉΊΌΎΏΐΪΫάέήίΰϊϋόύώϜϝἀἁἂἃἄἅἆἇἈἉἊἋἌἍἎἏἐἑἒἓἔἕἘἙἚἛἜἝἠἡἢἣἤἥἦἧἨἩἪἫἬἭἮἯἰἱἲἳἴἵἶἷἸἹἺἻἼἽἾἿὀὁὂὃὄὅὈὉὊὋὌὍὐὑὒὓὔὕὖὗὙὛὝὟὠὡὢὣὤὥὦὧὨὩὪὫὬὭὮὯὰὲὴὶὸὺὼᾀᾁᾂᾃᾄᾅᾆᾇᾈᾉᾊᾋᾌᾍᾎᾏᾐᾑᾒᾓᾔᾕᾖᾗᾘᾙᾚᾛᾜᾝᾞᾟᾠᾡᾢᾣᾤᾥᾦᾧᾨᾩᾪᾫᾬᾭᾮᾯᾲᾳᾴᾶᾷᾺᾼῂῃῄῆῇῈῊῌῒΐῖῗῚῢΰῤῥῦῧῪῬῲῳῴῶῷῸῺῼ"

extracted from your .txt file

And άέήίόύώΆΈΉΊΌΎΏ are the chars we should add in any case for modern greek ?

@metheofanis
Copy link
Author

Some thoughts:

  1. The chars to add to the modern Greek should also include: ϋ, ΰ, ϊ, ΐ and the capitals Ϋ, Ϊ
    So the full list is άέήίϊΐόύϋΰώΆΈΉΊΪΌΎΫΏ
    I also attach the new Greek extended only chars. I've removed the above accented chars

Greek Extended only.txt

@felixdittrich92
Copy link
Contributor

Hi @metheofanis 👋 ,

I updated the greek vocab in #1928 and added an extra entry for the full extended version (greek extended) but as already mentioned only included the "basic" greek vocab for now 👍

@felixdittrich92 felixdittrich92 self-assigned this May 13, 2025
@felixdittrich92 felixdittrich92 added type: enhancement Improvement module: datasets Related to doctr.datasets labels May 13, 2025
@felixdittrich92 felixdittrich92 added this to the 0.12.0 milestone May 13, 2025
@metheofanis
Copy link
Author

Does this mean that it will be able to OCR Greek Extended documents?

@felixdittrich92
Copy link
Contributor

We work atm on making docTR multilingual (training the recognition models on a multilingual dataset) and as mentioned we would include the "basic" greek vocab here in a first step.

For the first stage it's planned:
latin_ext + cyrillic_ext + greek + hebrew

if we see that this works well we could consider to include greek_ext instead

stage 2: arabic + hindi + brahmic script
stage 3: japanese + simplified chinese + korean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: datasets Related to doctr.datasets type: enhancement Improvement
Projects
None yet
Development

No branches or pull requests

2 participants