NoraInspection

Overview

In order to select the most effective method of text extract and parameterize text correction, it is relevant to distinguish the various ways in which the PDF documents in the NORA collection were produced, e.g. use of LaTeX, vs. M$ Word, vs. other word processing tools. In the LaTeX world, for example, it may matter which specific approach was used to output PDF, e.g. latex plus dvips plus ps2pdf, vs. pdflatex, vs. integrated tools like MiKTeX. Likewise, when using Word, results may vary according to which specific software version was used, or depending on whether Adobe Distiller or another tool for PDF creation was applied. When it comes to font choices and character encodings, it might also turn out that more basic properties of the original environment used for PDF creation are relevant, e.g. the choice of operating system (Linux vs. Windows, say) and default locale settings.

Presumably many dozens or hundreds of distinct software environments were at play in the production of the NORA PDF files, and hopefully most of this variation will be irrelevant for the WeScience₀ effort. Furthermore, only quite limited information about the original environment is recorded in the PDF files, hence it may at times be impossible to give exact answers to the dimensions of variation listed above. However, we need to find out what information about the production process actually is available in PDF files, and we will need a simple tool to inspect PDF meta information and extract relevant parameters. It is possible that some of the text extraction tools for PDF (see the NoraExtraction page) can be put to use in this task too.

Preliminary Notes

A report on observations in the extraction of PDF documents can be downloaded here: [http://folk.uio.no/gisley/wescience0/ola-duokonvnot1.pdf]

Metadata Survey

"Producer/Creator" fields from circa 3056 documents: [http://heim.ifi.uio.no/olasba/nora/metadata-sort1.txt]

grep -i word metadata-sort1.txt | wc -l yields 1784 entries where Word was involved in some way. The occurrences of "tex" work out to 259, "ghostscript" to 557.

A rough survey of this metadata was not obviously helpful. There's a great variety of different versions, and significant differences just among, e.g., the four documents produced with MiKTeX -- the great versatility of TeX may actually work against us, and the large percentage of documents from Word may actually make things easier.

Pathological Documents

(ID numbers follow DUO

Documents that are impaired or unreadable due to unconventional font encoding:

[http://www.duo.uio.no/sok/work.html?WORKID=67495 67495] (Svendsen 2007) - GPL Ghostscript 8.54 (very curious gibberish, mixing two styles)
[http://www.duo.uio.no/sok/work.html?WORKID=70059 70059] (Bendiksen 2008) - TeX, AFPL Ghostscript 6.50
[http://www.duo.uio.no/sok/work.html?WORKID=78191 78191] (Ulvestad 2008) - GPL Ghostscript SVN PRE-RELEASE 8.61
[http://www.duo.uio.no/sok/work.html?WORKID=78892 78892] (Hanssen 2008) - (Unknown), AFPL Ghostscript 8.51
[http://www.duo.uio.no/sok/work.html?WORKID=86557 86557] (Brændshøi 2008) - TeX, pdfTeX-1.40.3

Ghostscript documents which only feature unconverted glyphs on their front pages:

[http://www.duo.uio.no/sok/work.html?WORKID=65770 65770] (Thoresen 2007) - "x1x14x2x24x14x9x5x11x26x27x13x14x2x4x2x8"
[http://www.duo.uio.no/sok/work.html?WORKID=74555 74555] (Furuheim & Aasen 2008) - "x27x18x24x19x24x25x17x28x10 x22x24x21x19x26x10x24x21"
[http://www.duo.uio.no/sok/work.html?WORKID=79895 79895] (Pedersen 2008) - "D4CPCRCZCTD8 CPD2CPD0DDDECTD6 D3D2 CPD2"
[http://www.duo.uio.no/sok/work.html?WORKID=61749 61749] (Johansen 2007) - Actual control characters. PScript5.dll Version 5.2.2, GNU Ghostscript 7.06

In all these cases, other documents with an identical Creator field did not suffer similar problems.

Spacing issues:

[http://www.duo.uio.no/sok/work.html?WORKID=88173 88173] (Lungo 2008) - Every space is replaced by the string "g561". The producer is Acrobat Distiller 8.1.0.
[http://www.duo.uio.no/sok/work.html?WORKID=79764 79764] (Berg 2008) - Newline appears after most characters, resulting in an extremely vertical document.

Blank/unconverted documents:

[http://www.duo.uio.no/sok/work.html?WORKID=79631 79631] (Huse 2008) - TeX, GPL Ghostscript SVN PRE-RELEASE 8.61
[http://www.duo.uio.no/sok/work.html?WORKID=82351 82351] (blomskold?) - LaTeX with hyperref package, pdfTeX-1.40.3

For a more comprehensive list, observe the last screenfuls of ls -lS /logon/scratch/johanbev/raw-output/

Home | Forum | Discussions | Events

NoraInspection

Overview

Preliminary Notes

Metadata Survey

Pathological Documents

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!