Skip to content

For 2 column style TOC that is scan/ocr -- suggestion for best tool to extract TOC text? #1

@stillhope

Description

@stillhope

Hello, Can you suggest best tool to extract the TOC text, from a 2 column TOC style (PDF is scanned and ocr'd).

The problem with OCR space it does not read the text in columns, e.g. first column then second column. Rather it reads left to right, so you get the text in the wrong place

For example: extract result from OCR space is (chapter Six is in column 2 of the TOC and the tool has read it on line 1)

Contents
Number Chapter Six: Units..............:.......48
Length, mass, capacity
Chapter One: Types Of and time.... ....

The problem with Tabular is I could not find any 2 column style TOC template. I tried to create my own template as a new person, and it did a very average job (e.g. did not recognise end of sentence, kept leading ..... before page number. I could not find any auto scripts in sublime text editor to handle the typical TOC edit text issues either.

Nuntber,
Chapter One: Types of,
number ........................................... 2,
Squares and square roots .................,2
Cubes and cube roots .......................,2
Multiples .......................................,4
Prime factorisation ..........................,6
Chapter Two: Using numbers .....1 0,

Tabular is better than OCRspace, in the fact text is in the correct order but still alot of manipulation using Sublime Text Editor to get the "TOC text file " into the required layout to be able to auto-create TOC bookmarks in PDF (ie using one of the apps, pdftk or jpdfbookmarks)

Tabular is currently has no ability to ask questions of help. On github the issue tab is not showing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions