NoraExtraction

Overview

Many packages exist for text extraction from PDF, some based on OCR-like techniques (primarily for scanned documents), others working as limited PDF interpreters, reading out a pure text stream from `digitally born' documents. One of the more widely used packages appears to be Apache [http://incubator.apache.org/pdfbox/ PDFBox], which we will evaluate as our baseline—parallel to much ongoing work in the international ACL community.

Other open-source tools that we should assess include [http://pdftohtml.sourceforge.net/ PDFtoHTML] [http://poppler.freedesktop.org/ Poppler], [http://www.unixuser.org/~euske/python/pdfminer/index.html PDFMiner]. For a smaller sample of NORA documents, it may also make sense to contrastively look at non-open tools like [http://a-pdf.com/text/index.htm A-PDF Text Extractor] and Adobe Acrobat. There are some related [http://wiki.delph-in.net/moin/BarcelonaPreprocessing discussion notes] from the 2009 DELPH-IN Summit.

Home | Forum | Discussions | Events

NoraExtraction

Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!