-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
bugSomething isn't workingSomething isn't working
Description
hOCR files converted from PAGE have each TextEquiv in them, as opposed one variant, and, for fontshape - the style determined by fontshape.
I start with an empty workspace, add an image to it, and run
ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model pol -I images -O OCR-D-OCR"
then I annotate it with
ocrd-tesserocr-fontshape -I OCR-D-OCR -O OCR-D-OCR-FONTSHAPE -P model pol
and finally, convert it to hocr
ocrd-fileformat-transform -I OCR-D-OCR-FONTSHAPE -O hocr -P from-to "page hocr"
The resulting file has the words/segments doubled, and when fontshape is used - tripled.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working