-
Notifications
You must be signed in to change notification settings - Fork 505
Ancient Greek characters #1935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @metheofanis 👋, Thanks for pointing this out 👍 24 basic letters + 2 forms for sigma + upsilon + Xi <- that's what all resources share |
Yes, exactly. Greek Extended is the correct Unicode range, and the character set! |
If you also need a wordlist, I could provide. |
@metheofanis This would be good indeed 👍 For example:
There is mostly no visual difference - same for other chars 🤔 So we can at it as vocab but then I would suggest to provide it as extra instead of merging the current greek vocab with the new chars together .. |
Yes you are right. It is almost the same, for a lot of characters. It looks like we need a specialized engine for the Ancient (Polytonic) Greek. One process to find the character and another process to find the correct accent marks. Since I'm not expert in OCR or in programming. Please do whatever you think good. I'll be glad to help. The file I gave above has duplicates removed. They are some characters that they are defined in Greek and in Greek Extended. They will be always the same visually. |
ok the complete would then be:
extracted from your .txt file And |
Some thoughts:
|
Hi @metheofanis 👋 , I updated the greek vocab in #1928 and added an extra entry for the full extended version (greek extended) but as already mentioned only included the "basic" greek vocab for now 👍 |
Does this mean that it will be able to OCR Greek Extended documents? |
We work atm on making docTR multilingual (training the recognition models on a multilingual dataset) and as mentioned we would include the "basic" greek vocab here in a first step. For the first stage it's planned: if we see that this works well we could consider to include greek_ext instead stage 2: arabic + hindi + brahmic script |
It looks like most of the Ancient Greek characters are missing from vocabs.py
Do I miss something?
I attach below the full set.
grc_chars.txt
The text was updated successfully, but these errors were encountered: