Japanese & Korean OCRs are wildly innacurate #7708

thealexmay · 2023-12-05T17:10:14Z

thealexmay
Dec 5, 2023

Not sure if anyone here has experience OCRing Japanese and Korean subtitles. I've been trying to OCR through Tesseract, just updated to 5.3.3. I've downloaded the Japanese and Korean language packs as well, including the vert ones.

But no matter what I do, or what I select in the OCR menu, the OCR is just wildly inaccurate, it's rare that it even gets a single character right, especially for Japanese. It's to the point that it's not even useful and would probably take less time to transcribe the subtitles by hand over quality-checking the OCR. Every single line is just completely wrong.

Am I missing something here? How do I get accurate OCRs with Japanese or Korean subtitles?

Answered by coastal45

Dec 14, 2023

The Tesseract Manual page is here: TESSERACT(1) Manual Page As you can see, there are many options. At least you need to specify the language, and 8-bit black on clean white background works best. I don't know any Korean, but it seems I read before that characters with vertical lines on the left side (common in Hangul) do not OCR well.
If you have a specific problem, you can bring it up on the Tesseract issues page.

View full answer

coastal45 · 2023-12-11T17:48:10Z

coastal45
Dec 11, 2023

This is likely a Tesseract issue, not an SE one. Have you tried running Tesseract from the command line?
In any case, in Tesseract there are many factors that play into accuracy. More familiarity might be helpful if you aren't already.
But eastern language characters are more complex than western ones, and OCR by nature becomes more difficult.

It might help to know what command SE passes to Tesseract.

1 reply

thealexmay Dec 13, 2023
Author

I haven't tried running Tesseract from the command line. I don't know how to do that as I've never done it. I can do some googling and give it a shot.

Japanese I could see being problematic in terms of complex characters, especially with Kanji thrown in there. But I would think Korean would be alright since Hangul is fairly consistent.

coastal45 · 2023-12-14T05:37:21Z

coastal45
Dec 14, 2023

The Tesseract Manual page is here: TESSERACT(1) Manual Page As you can see, there are many options. At least you need to specify the language, and 8-bit black on clean white background works best. I don't know any Korean, but it seems I read before that characters with vertical lines on the left side (common in Hangul) do not OCR well.
If you have a specific problem, you can bring it up on the Tesseract issues page.

1 reply

thealexmay Dec 16, 2023
Author

Thanks for the link, I'll do some digging and testing with Tesseract and see if I can get it to be more consistent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Japanese & Korean OCRs are wildly innacurate #7708

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Japanese & Korean OCRs are wildly innacurate #7708

Uh oh!

thealexmay Dec 5, 2023

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

coastal45 Dec 11, 2023

Uh oh!

thealexmay Dec 13, 2023 Author

Uh oh!

coastal45 Dec 14, 2023

Uh oh!

thealexmay Dec 16, 2023 Author

thealexmay
Dec 5, 2023

Replies: 2 comments 2 replies

coastal45
Dec 11, 2023

thealexmay Dec 13, 2023
Author

coastal45
Dec 14, 2023

thealexmay Dec 16, 2023
Author