NoraExtraction

Overview

Many packages exist for text extraction from PDF, some based on OCR-like techniques (primarily for scanned documents), others working as limited PDF interpreters, reading out a pure text stream from `digitally born' documents. One of the more widely used packages appears to be Apache [http://incubator.apache.org/pdfbox/ PDFBox], which we will evaluate as our baseline—parallel to much ongoing work in the international ACL community.

Other open-source tools that we should assess include [http://pdftohtml.sourceforge.net/ PDFtoHTML] [http://poppler.freedesktop.org/ Poppler], and [http://www.unixuser.org/~euske/python/pdfminer/index.html PDFMiner]. For a smaller sample of NORA documents, it may also make sense to contrastively look at non-open tools like [http://a-pdf.com/text/index.htm A-PDF Text Extractor] and Adobe Acrobat. Some of these packages were briefly discussed at the 2009 DELPH-IN Summit; please see the [http://wiki.delph-in.net/moin/BarcelonaPreprocessing discussion notes] for details.

Home | Forum | Discussions | Events

NoraExtraction

Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!