-
Notifications
You must be signed in to change notification settings - Fork 0
Notebook
I wrote a short script to split the manifest for the Kennan Papers (MC076) into individual containers (1276 in number) and generate a graph of named entities for each. This is a resource-intensive process (each master TIFF must be downloaded, run through OCR, and processed with SpaCy); using my laptop connected to the Internet via a standard home FIOS service, I was able to process 501 in three days.
I then loaded all 501 graphs into a local instance of GraphDB-Free running on my laptop, resulting in approximately 40 million statements.
Below are some exploratory SPARQL queries. Recall that named entities are represented as Symbolic Objects (Appellations, when all is said and done) which have been recorded as Inscriptions on IIIF Canvases.
How many pages are we talking about?
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix entity: <https://figgy.princeton.edu/concerns/entities/>
prefix etype: <https://figgy.princeton.edu/concerns/adam/>
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?canvas {
?something ecrm:P128i_is_carried_by ?canvas .
}Sparql returns 26,917 results: there are about 27,000 pages in this sample.
How many named entities did SpaCy recognize?
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix entity: <https://figgy.princeton.edu/concerns/entities/>
prefix etype: <https://figgy.princeton.edu/concerns/adam/>
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?inscription where { ?inscription a ecrm:E34_Inscription .} This query returns 738,696 results in less than 0.1 seconds. This is how many “hits” SpaCy recorded, but this isn’t a very useful number, though: how many of these are distinct names?
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix entity: <https://figgy.princeton.edu/concerns/entities/>
prefix etype: <https://figgy.princeton.edu/concerns/adam/>
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?name {
?inscription a ecrm:E34_Inscription .
?inscription ecrm:P106_is_composed_of ?entity .
?entity ecrm:P190_has_symbolic_content ?name .
}This query returns 254,699 distinct strings that SpaCy identified as named entities.
How many names of people are there?
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/>
prefix entity: <https://figgy.princeton.edu/concerns/entities/>
prefix etype: <https://figgy.princeton.edu/concerns/adam/>
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?name {
?inscription a ecrm:E34_Inscription ; ecrm:E55_Type etype:PERSON .
?inscription ecrm:P106_is_composed_of ?entity .
?entity ecrm:P190_has_symbolic_content ?name .
}These are the strings SpaCy, running in naive mode over dirty OCR, identified as the names of persons. Sparql returns an astounding 95,982 results: almost 96,000 distinct names of people. Not good.
But what do these strings actually look like? The first dozen or so are promising:
?name Reith Jimmy Carter Reith Lectures Reagan Wilson Bill Casey Ronald Reagan Kennedy Robert Gates Gorbachev "Ronald Reagan\n" McNamara Buddenbrooks
On first glance, this isn’t a bad result; SpaCy picked out strings that are clearly names of one kind or another, and its classification of these names as names of persons is good, with a few exceptions: Reith Lectures is almost certainly the name of an event, not the name of a person. Buddenbrooks is harder to determine without context: it is probably the title of Thomas Mann’s novel, but it could be someone’s last name. More problematic, for a different reason, are the multiple appearances of Ronald Reagan in this list. We can be pretty sure Reagan and Ronald Reagan are the same person, though they might not be, but Ronald Reagan and “Ronald Reagan\n” are certainly the same. SpaCy’s tokenizer has failed to strip off the trailing newline in the second of those Ronald Reagans, and as a result, SpaCy’s named-entity recognizer has treated it as a separate name. This seems like a weakness in SpaCy, perhaps in our configuration (or lack of configuration), and we should flag this for further investigation.
There is something else to note here. Ronald Reagan and “Ronald Reagan\n” are orthographical variants of the same name, while Reagan is not, though all three refer (almost certainly) to the 40th President of the United States. That is, all three refer to the same named entity even though there are two names. Our application is not interested in names (or appellations, as they are called in CIDOC-CRM) but in those named entities, so our tools must help investigators (the archivists, in this case) weed and winnow these names and assign them to identifiable entities.
Of course, for our purposes this repetition may not be a problem: our application favors recall over precision, so we’re more concerned with not missing names than we are with picking up variants. The sheer number of names, though, could create challenges. Here are all the instances of Kissinger in this partial data set (the numbers are line numbers in the output file):
60:Henry Kissinger 63:Kissinger 144:HENRY KISSINGER 3777:Henry A.Kissinger 3779:Henry A. Kissinger 3785:Henry A. Kissinger's 6271:Henry Kissinger's 9881:Robert H. Bork Henry Kissinger Paul W. McCracken Harry 10072:"Henry Kissinger\n" 10097:Henry Kissinger’s 10222:Nixon-Kissinger 11018:"Henry\n\nKissinger" 11138:"Kissinger pro-\n" 11143:"Henry\nKissinger's" 14237:KISSINGER 14270:"Kissinger |\n" 14353:Henry A. Kissinger Lectures 21995:"Henry\nKissinger" 22740:"Henry A.\nKissinger" 30219:H. Kissinger 30237:ALFRED M. GRUENTHER HENRY A. KISSINGER 30468:A. Kissinger 30501:"Kissinger\n" 34353:Henmry Kissinger 39728:Henry A. Kissinger Theodore M. Hesburgh 39963:"Henry A. Kissinger Richard L. Gelb\n" 40166:"Henry\nA. Kissinger" 42573:Kissinger's- 64109:Messrs Kissinger 64573:Henry kissinger 94259:Henry Kissinger eine 94593:"H. Kissinger\n" 94700:Henry A. Kissinger - Vertreter eines
Filtering SpaCy’s candidates into actual named entities (there are seven people intermingled in these strings) will likely require a mixture of human and machine labor.
There are not 96,000 distinct names in this sample, even though it is a sample of 27,000 pages. This is one of the places where using uncorrected (“dirty”) OCR hampers our endeavors. Past that fortuitous group at the top of the list, the entries become very dirty indeed:
D. Signature "Jerzy\n" ieeiier rrr iri rir "Wee\n" Wdiinad Pugh William Peters E. List James E. Doyle "Fe es ee ee eee\n" New Yor ak ae sald Wolff Li mucn juirice Greenbaum AL VK MAURICE C. GREENBAUM L. KAT Madison Ave Svetlana
There are a number of options to consider here.
- Pre-filter the pages. We know that some of the pages are too dirty to yield any recognizable text. (The purple mimeographs are an example, as, of course are hand-written pages, drawings, poor-quality photocopies, and so on.) If we had a way to detect those, we could skip trying to find named entities in a see of garbage.
- Train a better model.
- Use tools like OpenRefine to clean the data by hand.
A combination of techniques will probably be required.
Some simple regular-expression-based filtering whittles the list down from 96,000 to 72,000. Clustering with OpenRefine will also be powerful.
Clustering is a technique commonly used in natural language processing. It entails finding groups of strings that are similar to one another, using various algorithms to calculate similarity. For example, George Kennan and George Kennen are very similar, because they differ by only one letter; with our data, we can say with great confidence that instances of the string George Kennen should be corrected to be George Kennan, thus reducing the number of name strings from two to one.
Other comparisons are not so straightforward. Suppose we are comparing F. L. Smith with F. T. Smith: are these two distinct people, or is one of these strings a mis-spelling of the other? Sometimes, if we know our data, we can make a good guess: John P. Kennedy is almost certainly John F. Kennedy. In other cases, we cannot tell without looking at the original context.
OpenRefine lets you apply half a dozen different clustering algorithms, each of which uses a different heuristic to calculate similarity. In practice, one applies each of them successively; for our experiment so far, I’ve just used the key-collision algorithms, which bring the list down to about 22,000 entries.
After another round with OpenRefine, we’re down to about 22,000 name candidates. I’ve started to keep a few snapshot lists in a Google Spreadsheet.
The results, so far, are disappointing. Clustering is a very effective technique, often used in text processing, but it does take time and human labor. At this stage, in a production context, one would probably assign a student (with an archivist to consult with) to perform more painstaking iterations over the data to winnow out partial names and mis-recognized strings and produce a working list of names.
Some observations:
- there are many German words and phrases in this list. I suspect the two-capitalized-words-in-a-row heuristic is responsible for these; I will do some research to see if there are standard techniques to handle this problem, which must be a common one.
- during these clustering/merging steps with OpenRefine, we’ve lost context: the string-by-string links back to canvases. There will be ways to do that, but they will require more overhead than we want to spend now.
OpenRefine’s clustering algorithms are indeed powerful, but there is simply too much kruft in this data set: nonsensical strings and whatnot. Let’s see if we can improve SpaCy’s NER model to give us more accurate results to start with.
I’m using Prodigy, a companion to SpaCy, developed by the same company. Prodigy is an annotation tool that uses machine learning to train data models. It isn’t free, but I have a research license.
We’ll begin by gathering training data. I haven’t been keeping the OCR output but we can do that easily enough. In fact, we’ll use SpaCy to generate data sets in one of SpaCy’s preferred data formats. And we’ll extend our object models to include metadata about the collection, the container, and the page.
Here’s an example of some training data in jsonl format:
{"text": "Lhe As for the rest of the Soviet Union: the situation that prevails there is both dreadful and dangerous.", "meta": {"Date Created": ["1991 February 3"], "Extent": ["1 folder"], "Identifier": ["ark:/88435/d504rt661"], "Title": ["\"If the Kremlin Can't Rule,\" Op-Ed about the Baltics, The Washington Post "], "Creator": ["Kennan, George F. (George Frost), 1904-2005."], "Language": ["English"], "Publisher": ["Kennan, George F. (George Frost), 1904-2005."], "Portion Note": ["entire component, excluding the C Section of the Washington Post, Feb 3, 1991"], "Container": ["Box 294, Folder 4"], "Rendered Holding Location": ["Mudd Manuscript Library"], "Member Of Collections": ["George F. Kennan Papers MC076"]}}
{"text": "If it is true, as it appears to be, that the supply of consumers’ goods to the larger cities cannot be assured without the wholehearted collaboration of the party apparatus and the armed units in the great rural hinterland of the country, then one could understand why Gorbachev has felt himself compelled to reach back at this time for the support of those institutions.", "meta": {"Date Created": ["1991 February 3"], "Extent": ["1 folder"], "Identifier": ["ark:/88435/d504rt661"], "Title": ["\"If the Kremlin Can't Rule,\" Op-Ed about the Baltics, The Washington Post "], "Creator": ["Kennan, George F. (George Frost), 1904-2005."], "Language": ["English"], "Publisher": ["Kennan, George F. (George Frost), 1904-2005."], "Portion Note": ["entire component, excluding the C Section of the Washington Post, Feb 3, 1991"], "Container": ["Box 294, Folder 4"], "Rendered Holding Location": ["Mudd Manuscript Library"], "Member Of Collections": ["George F. Kennan Papers MC076"]}}Let’s try training on some of this data.
prodigy ner.manual ner_cold_war_papers blank:en ~/Desktop/training2/ea9a223d-e23c-4d86-894a-4164902ffc3b.jsonl --label PERSONNice resource on training SpaCy models: https://www.youtube.com/channel/UC5vr5PwcXiKX_-6NTteAlXw
What have we accomplished so far?
- We have developed software that enables us to build, in an unattended fashion, datasets of candidate named entities from pages, containers, and entire collections, based on Figgy’s IIIF manifests.
- We have developed a data model that enables us to represent this (meta)data as annotations to IIIF canvases, thereby integrating it with Figgy’s underlying data model and the IIIF software base (viewers, annotation servers) already developed by ITMS.
- We have begun to analyze the data that results from naive applications of NLP software.
Unsurprisingly, the brute-force naive approach we’ve applied so far is unsatisfactory: it produces too much noise. How can we improve these results so that we can produce a useful set of infrequent names?
- Be smarter about what you look at.
- Our tools naively process
every page in the collection. Some of that data may not be
useful or relevant (drafts of published works; newspaper
clippings; handwritten notes (which cannot yet be processed with
OCR); other ephemera. In a reality, an archivist would pre-select
the components of the collection that are most amenable to this
kind of analysis.
We also apply NER to the OCR output without checking on its quality: if we could throw out pages that were poorly recognized (again, hand-written materials; mimeographs; other bad originals), we might improve our overall NER: less garbage in, less garbage out.
- Take smaller bites.
- Archival collections are naturally sub-divided into thematically related components and sub-components. We are likely to get better results if we used those subdivisions to our advantage: to make hand-correction tractable; to train models iteratively.
- Filter out poor OCR. Use confidence thresholds produced by Tesseract. Unfortunately, that means we can’t use the OCR already produced by Figgy.
- Be selective in what we process. Use the Collection’s Indexes to produce training data. Concentrate on the Correspondence series.
- Some containers might be amenable to image cleanup to improve OCR.
- Augment our training set with more patterns. Will & Alexis have provided some name lists to help train our model, but we can expand that training set using some common NLP techniques
Correspondence is a good set to work with. Correspondence usually has lots of names; the names will likely vary by correspondent (the social network formed by names mentioned in correspondence would probably be interesting); and there’s a lot of it in the Kennan Papers. We’ll start with subseries 1A, because much of it has been digitized.
There are 658 files in subseries 1A, including an index:
-
Index of permanent files, undated
This index is an excellent data set for training; we’ll look at that in a minute. But first, let’s work on making the base data (the OCR output) better.
OCR engines (like Tesseract) can produce plain-text output, but they usually can do much more. We’ve seen how Tesseract can serialize the text it recognizes as hOCR or ALTO, but it can also generate a detailed table of data as output, data that includes confidence scores for each word and each block of text it discovers. A confidence score is a measure of how accurately the program thinks it has recognized the word (or block, or even character) correctly. We know now, from experience, that if the OCR is poor, the NER will be poor, so if we can filter out text that has been badly OCR’d, our NER accuracy should improve.
Deciding where to set the threshold may require some trial and error. Based on some research, it looks like setting the cutoff somewhere between 97.5 and 98.5 is common in real-world applications. Let’s try both ends and see what happens.
It turns out those numbers don’t work at the block level; too many blocks get rejected. Something closer to 55 seems to be in the right range, but this may not be the best way; perhaps it will be better to filter at the word level.
After some trial and error, we have a version of our software that filters out bad OCR. (For now, I’m using a word-level threshold of 95.) And it is looking much better. SpaCy tagged 1,024 distinct names after running the program over half a dozen folders of correspondence; here’s a sample:
Kennan George Kennan Schuster Nelson Gaylord Nelson Simon Directo Nelson Suite George Kennan Papers Stalin Urban GREENBAUM Grantor Alan Singh Greenbaum Svetlana Vincent Greenbaum Kennan Peters Herrman "2329 Princeton" Zajac Consul Wes Trustees Assignment
I am sure doing a bit of training on the SpaCy model will make this even better, but this is something we can work with.
Lots of fussing with image-processing code today. OpenCV was causing some logic errors and seemed like overkill; Pillow’s tiff plugin seems to be buggy. Turns out pytesseract can read images from a file on its own, so that’s what we’re doing. Running a long job now, to process the entire correspondence subseries.
While waiting for the NER job to complete, it’s time to think about the next stage: what to do with this metadata? Here’s what we have so far:
inscription:47m5R9JfBPi8YkpX975FiF a ecrm:E34_Inscription ;
ecrm:E55_Type etype:PERSON ;
ecrm:P106_is_composed_of entity:4UrFk3unCBXYpgza3Fiy7t ;
ecrm:P128i_is_carried_by <https://figgy.princeton.edu/concern/scanned_resources/49067c79-6915-4492-bd75-3554f0010ee3/manifest/canvas/0089725a-195f-42a4-8bfb-c44cf7f182d1> .
entity:4UrFk3unCBXYpgza3Fiy7t a ecrm:E90_Symbolic_Object ;
rdfs:label "DEAN BROWN" ;
ecrm:P190_has_symbolic_content "DEAN BROWN" .
We need to work on the ontology a bit.
E34 Inscription
rdfs:comment “Scope note:
This class comprises recognisable, texts attached to instances of E24 Physical Human-Made Thing.
The transcription of the text can be documented in a note by P3 has note: E62 String. The alphabet used can be documented by P2 has type: E55 Type. This class does not intend to describe the idiosyncratic characteristics of an individual physical embodiment of an inscription, but the underlying prototype. The physical embodiment is modelled in the CIDOC CRM as instances of E24 Physical Human-Made Thing.
The relationship of a physical copy of a book to the text it contains is modelled using E18 Physical Thing P128 carries E33 Linguistic Object.
Examples:
- "keep off the grass" on a sign stuck in the lawn of the quad of Balliol College
- The text published in Corpus Inscriptionum Latinarum V 895
- Kilroy was here
Since an inscription I is a text, it cannot have a type PERSON; I invented the etype PERSON to capture SpaCy’s classification.
Minor point: The use of the term entity to refer to a Symbolic Object is confusing. I was trying to avoid committing to any classification of the string, but that might be overly cautious. At the very least, the namespace should be called symbols, but we can probably use SpaCy’s classification to enable us to say that the symbol is an E41 Appellation:
Instances of E41 Appellation may be used to identify any instance of E1 CRM Entity and sometimes are characteristic for instances of more specific subclasses E1 CRM Entity, such as for instances of E52 Time-Span (for instance “dates”), E39 Actor, E53 Place or E28 Conceptual Object. Postal addresses and E-mail addresses are characteristic examples of identifiers used by services transporting things between clients.
The Appellation has symbolic content, which is the string.
So an Inscription can be located on a canvas, and it may be composed of an Appellation. And, ultimately, the Appellation may P1 identify an Actor, who may be an E21 Person.
The Appellation may be incorrectly recognized by the OCR, in which case it may be corrected; or the Appellation may be a misspelling, in which case it should be preserved.
@prefix ecrm: <http://erlangen-crm.org/200717/> .
@prefix appellation: <https://figgy.princeton.edu/concerns/appellations/> .
@prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
inscription:47m5R9JfBPi8YkpX975FiF a ecrm:E34_Inscription ;
ecrm:P106_is_composed_of appellation:4UrFk3unCBXYpgza3Fiy7t ;
ecrm:P128i_is_carried_by <https://figgy.princeton.edu/concern/scanned_resources/49067c79-6915-4492-bd75-3554f0010ee3/manifest/canvas/0089725a-195f-42a4-8bfb-c44cf7f182d1> .
appellation:4UrFk3unCBXYpgza3Fiy7t a ecrm:E41_Appellation ;
rdfs:label "George Kennan" ;
ecrm:P190_has_appellationic_content "George Kennan" ;
ecrm:P1i_identifies actor:xyz .
actor:xyz a ecrm:E21_Person ;
skos:prefLabel "Kennan, George Frost, 1904-2005" ;
owl:sameAs <http://viaf.org/viaf/66477608> .
This is nice!
from rdflib import Graph
manifest= 'https://figgy.princeton.edu/concern/scanned_resources/2a701cb1-33d4-4112-bf5d-65123e8aa8e7/manifest'
g = Graph()
g.parse(manifest, format='json-ld')
g.serialize(destination="2a701cb1-33d4-4112-bf5d-65123e8aa8e7.ttl")So we can actually include the manifests in our graph.