-
Notifications
You must be signed in to change notification settings - Fork 0
Notebook
I wrote a short script to split the manifest for the Kennan Papers (MC076) into individual containers (1276 in number) and generate a graph of named entities for each. This is a resource-intensive process (each master TIFF must be downloaded, run through OCR, and processed with SpaCy); using my laptop connected to the Internet via a standard home FIOS service, I was able to process 501 in three days.
I then loaded all 501 graphs into a local instance of GraphDB-Free running on my laptop, resulting in approximately 40 million statements.
Below are some exploratory SPARQL queries. Recall that named entities are represented as Symbolic Objects (Appellations, when all is said and done) which have been recorded as Inscriptions on IIIF Canvases.
How many pages are we talking about?
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix entity: <https://figgy.princeton.edu/concerns/entities/>
prefix etype: <https://figgy.princeton.edu/concerns/adam/>
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?canvas {
?something ecrm:P128i_is_carried_by ?canvas .
}Sparql returns 26,917 results: there are about 27,000 pages in this sample.
How many named entities did SpaCy recognize?
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix entity: <https://figgy.princeton.edu/concerns/entities/>
prefix etype: <https://figgy.princeton.edu/concerns/adam/>
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?inscription where { ?inscription a ecrm:E34_Inscription .} This query returns 738,696 results in less than 0.1 seconds. This is how many “hits” SpaCy recorded, but this isn’t a very useful number, though: how many of these are distinct names?
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix entity: <https://figgy.princeton.edu/concerns/entities/>
prefix etype: <https://figgy.princeton.edu/concerns/adam/>
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?name {
?inscription a ecrm:E34_Inscription .
?inscription ecrm:P106_is_composed_of ?entity .
?entity ecrm:P190_has_symbolic_content ?name .
}This query returns 254,699 distinct strings that SpaCy identified as named entities.
How many names of people are there?
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix entity: <https://figgy.princeton.edu/concerns/entities/>
prefix etype: <https://figgy.princeton.edu/concerns/adam/>
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?name {
?inscription a ecrm:E34_Inscription ; ecrm:E55_Type etype:PERSON .
?inscription ecrm:P106_is_composed_of ?entity .
?entity ecrm:P190_has_symbolic_content ?name .
}These are the strings SpaCy, running in naive mode over dirty OCR, identified as the names of persons. Sparql returns an astounding 95,982 results: almost 96,000 distinct names of people. Not good.
But what do these strings actually look like? The first dozen or so are promising:
?name Reith Jimmy Carter Reith Lectures Reagan Wilson Bill Casey Ronald Reagan Kennedy Robert Gates Gorbachev "Ronald Reagan\n" McNamara Buddenbrooks
On first glance, this isn’t a bad result; SpaCy picked out strings that are clearly names of one kind or another, and its classification of these names as names of persons is good, with a few exceptions: Reith Lectures is almost certainly the name of an event, not the name of a person. Buddenbrooks is harder to determine without context: it is probably the title of Thomas Mann’s novel, but it could be someone’s last name. More problematic, for a different reason, are the multiple appearances of Ronald Reagan in this list. We can be pretty sure Reagan and Ronald Reagan are the same person, though they might not be, but Ronald Reagan and “Ronald Reagan\n” are certainly the same. SpaCy’s tokenizer has failed to strip off the trailing newline in the second of those Ronald Reagans, and as a result, SpaCy’s named-entity recognizer has treated it as a separate name. This seems like a weakness in SpaCy, perhaps in our configuration (or lack of configuration), and we should flag this for further investigation.
There is something else to note here. Ronald Reagan and “Ronald Reagan\n” are orthographical variants of the same name, while Reagan is not, though all three refer (almost certainly) to the 40th President of the United States. That is, all three refer to the same named entity even though there are two names. Our application is not interested in names (or appellations, as they are called in CIDOC-CRM) but in those named entities, so our tools must help investigators (the archivists, in this case) weed and winnow these names and assign them to identifiable entities.
Of course, for our purposes this repetition may not be a problem: our application favors recall over precision, so we’re more concerned with not missing names than we are with picking up variants. The sheer number of names, though, could create challenges. Here are all the instances of Kissinger in this partial data set (the numbers are line numbers in the output file):
60:Henry Kissinger 63:Kissinger 144:HENRY KISSINGER 3777:Henry A.Kissinger 3779:Henry A. Kissinger 3785:Henry A. Kissinger's 6271:Henry Kissinger's 9881:Robert H. Bork Henry Kissinger Paul W. McCracken Harry 10072:"Henry Kissinger\n" 10097:Henry Kissinger’s 10222:Nixon-Kissinger 11018:"Henry\n\nKissinger" 11138:"Kissinger pro-\n" 11143:"Henry\nKissinger's" 14237:KISSINGER 14270:"Kissinger |\n" 14353:Henry A. Kissinger Lectures 21995:"Henry\nKissinger" 22740:"Henry A.\nKissinger" 30219:H. Kissinger 30237:ALFRED M. GRUENTHER HENRY A. KISSINGER 30468:A. Kissinger 30501:"Kissinger\n" 34353:Henmry Kissinger 39728:Henry A. Kissinger Theodore M. Hesburgh 39963:"Henry A. Kissinger Richard L. Gelb\n" 40166:"Henry\nA. Kissinger" 42573:Kissinger's- 64109:Messrs Kissinger 64573:Henry kissinger 94259:Henry Kissinger eine 94593:"H. Kissinger\n" 94700:Henry A. Kissinger - Vertreter eines
Filtering SpaCy’s candidates into actual named entities (there are seven people intermingled in these strings) will likely require a mixture of human and machine labor.
There are not 96,000 distinct names in this sample, even though it is a sample of 27,000 pages. This is one of the places where using uncorrected (“dirty”) OCR hampers our endeavors. Past that fortuitous group at the top of the list, the entries become very dirty indeed:
D. Signature "Jerzy\n" ieeiier rrr iri rir "Wee\n" Wdiinad Pugh William Peters E. List James E. Doyle "Fe es ee ee eee\n" New Yor ak ae sald Wolff Li mucn juirice Greenbaum AL VK MAURICE C. GREENBAUM L. KAT Madison Ave Svetlana
There are a number of options to consider here.
- Pre-filter the pages. We know that some of the pages are too dirty to yield any recognizable text. (The purple mimeographs are an example, as, of course are hand-written pages, drawings, poor-quality photocopies, and so on.) If we had a way to detect those, we could skip trying to find named entities in a see of garbage.
- Train a better model.
- Use tools like OpenRefine to clean the data by hand.
A combination of techniques will probably be required.