Text extraction from a Image using OCR stored in PDF #4414
Unanswered
Prasaderp
asked this question in
Looking for help
Replies: 1 comment 1 reply
-
PyMuPDF supports "partial OCR":
... get_textpage_ocr which results in a joint "corpus" of all text on the page. The extraction sequence of this is however
So you need to use sorting by geometrical information when required. A good first approximation can be achieved by this snippet textpage = page.get_text_ocr(dpi=150, partial=True,...)
blocks = page.get_text("blocks", textpage=textpage, sort=True) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
So, I have a PDF which has a Title stored using a Image and also normal text. How can i extract both normal text and OCR text on the image together using PyMuPDF. I am able to extract text from PDF but i also want to extract the OCR image text too from same PDF which actually is the Title of the PDF.
Seehere in the Image below I can extract all the normal text which I have selected using ctrl+A but here u can see some Text inside the Images eg: Attention to, Farm etc caanot be extracted. How can I achieve that too!
Beta Was this translation helpful? Give feedback.
All reactions