Japanese & Korean OCRs are wildly innacurate #7708
-
Not sure if anyone here has experience OCRing Japanese and Korean subtitles. I've been trying to OCR through Tesseract, just updated to 5.3.3. I've downloaded the Japanese and Korean language packs as well, including the vert ones. But no matter what I do, or what I select in the OCR menu, the OCR is just wildly inaccurate, it's rare that it even gets a single character right, especially for Japanese. It's to the point that it's not even useful and would probably take less time to transcribe the subtitles by hand over quality-checking the OCR. Every single line is just completely wrong. Am I missing something here? How do I get accurate OCRs with Japanese or Korean subtitles? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
This is likely a Tesseract issue, not an SE one. Have you tried running Tesseract from the command line? It might help to know what command SE passes to Tesseract. |
Beta Was this translation helpful? Give feedback.
-
The Tesseract Manual page is here: TESSERACT(1) Manual Page As you can see, there are many options. At least you need to specify the language, and 8-bit black on clean white background works best. I don't know any Korean, but it seems I read before that characters with vertical lines on the left side (common in Hangul) do not OCR well. |
Beta Was this translation helpful? Give feedback.
The Tesseract Manual page is here: TESSERACT(1) Manual Page As you can see, there are many options. At least you need to specify the language, and 8-bit black on clean white background works best. I don't know any Korean, but it seems I read before that characters with vertical lines on the left side (common in Hangul) do not OCR well.
If you have a specific problem, you can bring it up on the Tesseract issues page.