Skip to content

NoraExtraction

StephanOepen edited this page Aug 26, 2009 · 29 revisions

Overview

Many packages exist for text extraction from PDF, some based on OCR-like techniques (primarily for scanned documents), others working as limited PDF interpreters, reading out a pure text stream from `digitally born' documents. One of the more widely used packages appears to be Apache [http://incubator.apache.org/pdfbox/ PDFBox], which we will evaluate as our baseline—parallel to much ongoing work in the international ACL community.

Other open-source tools that we should assess include [http://pdftohtml.sourceforge.net/ PDFtoHTML] [http://poppler.freedesktop.org/ Poppler], and [http://www.unixuser.org/~euske/python/pdfminer/index.html PDFMiner]. For a smaller sample of NORA documents, it may also make sense to contrastively look at non-open tools like [http://a-pdf.com/text/index.htm A-PDF Text Extractor] and Adobe Acrobat. Some of these packages were briefly discussed at the 2009 DELPH-IN Summit; please see the [http://wiki.delph-in.net/moin/BarcelonaPreprocessing discussion notes] for details.

Clone this wiki locally