-
Notifications
You must be signed in to change notification settings - Fork 4
NoraExtraction
Many packages exist for text extraction from PDF, some based on OCR-like techniques (primarily for scanned documents), others working as limited PDF interpreters, reading out a pure text stream from `digitally born' documents. One of the more widely used packages appears to be Apache [http://incubator.apache.org/pdfbox/ PDFBox], which we will evaluate as our baseline—parallel to much ongoing work in the international ACL community.
Other open-source tools that we should assess include [http://pdftohtml.sourceforge.net/ PDFtoHTML] [http://poppler.freedesktop.org/ Poppler], [http://www.unixuser.org/~euske/python/pdfminer/index.html PDFMiner]. For a smaller sample of NORA documents, it may also make sense to contrastively look at non-open tools like [http://a-pdf.com/text/index.htm A-PDF Text Extractor] and Adobe Acrobat. There are some related [http://wiki.delph-in.net/moin/BarcelonaPreprocessing discussion notes] from the 2009 DELPH-IN Summit.
Home | Forum | Discussions | Events