Notebook

<2022-02-18 Fri>

I wrote a short script to split the manifest for the Kennan Papers (MC076) into individual containers (1276 in number) and generate a graph of named entities for each. This is a resource-intensive process (each master TIFF must be downloaded, run through OCR, and processed with SpaCy); using my laptop connected to the Internet via a standard home FIOS service, I was able to process 501 in three days.

I then loaded all 501 graphs into a local instance of GraphDB-Free running on my laptop, resulting in approximately 40 million statements.

Below are some exploratory SPARQL queries. Recall that named entities are represented as Symbolic Objects (Appellations, when all is said and done) which have been recorded as Inscriptions on IIIF Canvases.

How many pages are we talking about?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?canvas { 
  ?something ecrm:P128i_is_carried_by ?canvas .
}

Sparql returns 26,917 results: there are about 27,000 pages in this sample.

How many named entities did SpaCy recognize?

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

select ?inscription where { ?inscription a ecrm:E34_Inscription .}

This query returns 738,696 results in less than 0.1 seconds. This is how many “hits” SpaCy recorded, but this isn’t a very useful number, though: how many of these are distinct names?

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?name { 
  ?inscription a ecrm:E34_Inscription .
  ?inscription ecrm:P106_is_composed_of ?entity .
  ?entity ecrm:P190_has_symbolic_content ?name .
}

This query returns 254,699 distinct strings that SpaCy identified as named entities.

How many names of people are there?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?name { 
  ?inscription a ecrm:E34_Inscription ; ecrm:E55_Type etype:PERSON .
  ?inscription ecrm:P106_is_composed_of ?entity .
  ?entity ecrm:P190_has_symbolic_content ?name .
}

These are the strings SpaCy, running in naive mode over dirty OCR, identified as the names of persons. Sparql returns an astounding 95,982 results: almost 96,000 distinct names of people. Not good.

But what do these strings actually look like? The first dozen or so are promising:

?name
Reith
Jimmy Carter
Reith Lectures
Reagan
Wilson
Bill Casey
Ronald Reagan
Kennedy
Robert Gates
Gorbachev
"Ronald Reagan\n"
McNamara
Buddenbrooks

On first glance, this isn’t a bad result; SpaCy picked out strings that are clearly names of one kind or another, and its classification of these names as names of persons is good, with a few exceptions: Reith Lectures is almost certainly the name of an event, not the name of a person. Buddenbrooks is harder to determine without context: it is probably the title of Thomas Mann’s novel, but it could be someone’s last name. More problematic, for a different reason, are the multiple appearances of Ronald Reagan in this list. We can be pretty sure Reagan and Ronald Reagan are the same person, though they might not be, but Ronald Reagan and “Ronald Reagan\n” are certainly the same. SpaCy’s tokenizer has failed to strip off the trailing newline in the second of those Ronald Reagans, and as a result, SpaCy’s named-entity recognizer has treated it as a separate name. This seems like a weakness in SpaCy, perhaps in our configuration (or lack of configuration), and we should flag this for further investigation.

There is something else to note here. Ronald Reagan and “Ronald Reagan\n” are orthographical variants of the same name, while Reagan is not, though all three refer (almost certainly) to the 40th President of the United States. That is, all three refer to the same named entity even though there are two names. Our application is not interested in names (or appellations, as they are called in CIDOC-CRM) but in those named entities, so our tools must help investigators (the archivists, in this case) weed and winnow these names and assign them to identifiable entities.

Of course, for our purposes this repetition may not be a problem: our application favors recall over precision, so we’re more concerned with not missing names than we are with picking up variants. The sheer number of names, though, could create challenges. Here are all the instances of Kissinger in this partial data set (the numbers are line numbers in the output file):

   60:Henry Kissinger
   63:Kissinger
  144:HENRY KISSINGER
 3777:Henry A.Kissinger
 3779:Henry A. Kissinger
 3785:Henry A. Kissinger's
 6271:Henry Kissinger's
 9881:Robert H. Bork Henry Kissinger Paul W. McCracken Harry
10072:"Henry Kissinger\n"
10097:Henry Kissinger’s
10222:Nixon-Kissinger
11018:"Henry\n\nKissinger"
11138:"Kissinger pro-\n"
11143:"Henry\nKissinger's"
14237:KISSINGER
14270:"Kissinger |\n"
14353:Henry A. Kissinger Lectures
21995:"Henry\nKissinger"
22740:"Henry A.\nKissinger"
30219:H. Kissinger
30237:ALFRED M. GRUENTHER HENRY A. KISSINGER
30468:A. Kissinger
30501:"Kissinger\n"
34353:Henmry Kissinger
39728:Henry A. Kissinger Theodore M. Hesburgh
39963:"Henry A. Kissinger Richard L. Gelb\n"
40166:"Henry\nA. Kissinger"
42573:Kissinger's-
64109:Messrs Kissinger
64573:Henry kissinger
94259:Henry Kissinger eine
94593:"H. Kissinger\n"
94700:Henry A. Kissinger - Vertreter eines

Filtering SpaCy’s candidates into actual named entities (there are seven people intermingled in these strings) will likely require a mixture of human and machine labor.

<2022-02-19 Sat>

There are not 96,000 distinct names in this sample, even though it is a sample of 27,000 pages. This is one of the places where using uncorrected (“dirty”) OCR hampers our endeavors. Past that fortuitous group at the top of the list, the entries become very dirty indeed:

D. Signature
"Jerzy\n"
ieeiier rrr iri rir
"Wee\n"
Wdiinad Pugh
William Peters
E. List
James E. Doyle
"Fe es ee ee eee\n"
New Yor
ak ae
sald
Wolff
Li mucn
juirice Greenbaum
AL VK
MAURICE C. GREENBAUM
L. KAT
Madison Ave
Svetlana

There are a number of options to consider here.

Pre-filter the pages. We know that some of the pages are too dirty to yield any recognizable text. (The purple mimeographs are an example, as, of course are hand-written pages, drawings, poor-quality photocopies, and so on.) If we had a way to detect those, we could skip trying to find named entities in a see of garbage.
Train a better model.
Use tools like OpenRefine to clean the data by hand.

A combination of techniques will probably be required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Notebook

Notebook

<2022-02-18 Fri>

<2022-02-19 Sat>

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally