Converting from PAGE to hocr creates double results

hOCR files converted from PAGE have each TextEquiv in them, as opposed one variant, and, for fontshape - the style determined by fontshape.

I start with an empty workspace, add an image to it, and run
`ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model pol -I images -O OCR-D-OCR"`
then I annotate it with
`ocrd-tesserocr-fontshape -I OCR-D-OCR -O OCR-D-OCR-FONTSHAPE -P model pol`
and finally, convert it to hocr
`ocrd-fileformat-transform -I OCR-D-OCR-FONTSHAPE -O hocr -P from-to "page hocr"`

The resulting file has the words/segments doubled, and when fontshape is used - tripled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Converting from PAGE to hocr creates double results #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Converting from PAGE to hocr creates double results #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions