Skip to content

Notebook

Cliff Wulfman edited this page Mar 7, 2022 · 17 revisions

Notebook

<2022-02-18 Fri>

I wrote a short script to split the manifest for the Kennan Papers (MC076) into individual containers (1276 in number) and generate a graph of named entities for each. This is a resource-intensive process (each master TIFF must be downloaded, run through OCR, and processed with SpaCy); using my laptop connected to the Internet via a standard home FIOS service, I was able to process 501 in three days.

I then loaded all 501 graphs into a local instance of GraphDB-Free running on my laptop, resulting in approximately 40 million statements.

Below are some exploratory SPARQL queries. Recall that named entities are represented as Symbolic Objects (Appellations, when all is said and done) which have been recorded as Inscriptions on IIIF Canvases.

How many pages are we talking about?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?canvas { 
  ?something ecrm:P128i_is_carried_by ?canvas .
}

Sparql returns 26,917 results: there are about 27,000 pages in this sample.

How many named entities did SpaCy recognize?

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

select ?inscription where { ?inscription a ecrm:E34_Inscription .} 

This query returns 738,696 results in less than 0.1 seconds. This is how many “hits” SpaCy recorded, but this isn’t a very useful number, though: how many of these are distinct names?

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?name { 
  ?inscription a ecrm:E34_Inscription .
  ?inscription ecrm:P106_is_composed_of ?entity .
  ?entity ecrm:P190_has_symbolic_content ?name .
}

This query returns 254,699 distinct strings that SpaCy identified as named entities.

How many names of people are there?

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix ecrm: <http://erlangen-crm.org/200717/> 
prefix entity: <https://figgy.princeton.edu/concerns/entities/> 
prefix etype: <https://figgy.princeton.edu/concerns/adam/> 
prefix inscription: <https://figgy.princeton.edu/concerns/inscriptions/> 
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
select distinct ?name { 
  ?inscription a ecrm:E34_Inscription ; ecrm:E55_Type etype:PERSON .
  ?inscription ecrm:P106_is_composed_of ?entity .
  ?entity ecrm:P190_has_symbolic_content ?name .
}

These are the strings SpaCy, running in naive mode over dirty OCR, identified as the names of persons. Sparql returns an astounding 95,982 results: almost 96,000 distinct names of people. Not good.

But what do these strings actually look like? The first dozen or so are promising:

?name
Reith
Jimmy Carter
Reith Lectures
Reagan
Wilson
Bill Casey
Ronald Reagan
Kennedy
Robert Gates
Gorbachev
"Ronald Reagan\n"
McNamara
Buddenbrooks

On first glance, this isn’t a bad result; SpaCy picked out strings that are clearly names of one kind or another, and its classification of these names as names of persons is good, with a few exceptions: Reith Lectures is almost certainly the name of an event, not the name of a person. Buddenbrooks is harder to determine without context: it is probably the title of Thomas Mann’s novel, but it could be someone’s last name. More problematic, for a different reason, are the multiple appearances of Ronald Reagan in this list. We can be pretty sure Reagan and Ronald Reagan are the same person, though they might not be, but Ronald Reagan and “Ronald Reagan\n” are certainly the same. SpaCy’s tokenizer has failed to strip off the trailing newline in the second of those Ronald Reagans, and as a result, SpaCy’s named-entity recognizer has treated it as a separate name. This seems like a weakness in SpaCy, perhaps in our configuration (or lack of configuration), and we should flag this for further investigation.

There is something else to note here. Ronald Reagan and “Ronald Reagan\n” are orthographical variants of the same name, while Reagan is not, though all three refer (almost certainly) to the 40th President of the United States. That is, all three refer to the same named entity even though there are two names. Our application is not interested in names (or appellations, as they are called in CIDOC-CRM) but in those named entities, so our tools must help investigators (the archivists, in this case) weed and winnow these names and assign them to identifiable entities.

Of course, for our purposes this repetition may not be a problem: our application favors recall over precision, so we’re more concerned with not missing names than we are with picking up variants. The sheer number of names, though, could create challenges. Here are all the instances of Kissinger in this partial data set (the numbers are line numbers in the output file):

   60:Henry Kissinger
   63:Kissinger
  144:HENRY KISSINGER
 3777:Henry A.Kissinger
 3779:Henry A. Kissinger
 3785:Henry A. Kissinger's
 6271:Henry Kissinger's
 9881:Robert H. Bork Henry Kissinger Paul W. McCracken Harry
10072:"Henry Kissinger\n"
10097:Henry Kissinger’s
10222:Nixon-Kissinger
11018:"Henry\n\nKissinger"
11138:"Kissinger pro-\n"
11143:"Henry\nKissinger's"
14237:KISSINGER
14270:"Kissinger |\n"
14353:Henry A. Kissinger Lectures
21995:"Henry\nKissinger"
22740:"Henry A.\nKissinger"
30219:H. Kissinger
30237:ALFRED M. GRUENTHER HENRY A. KISSINGER
30468:A. Kissinger
30501:"Kissinger\n"
34353:Henmry Kissinger
39728:Henry A. Kissinger Theodore M. Hesburgh
39963:"Henry A. Kissinger Richard L. Gelb\n"
40166:"Henry\nA. Kissinger"
42573:Kissinger's-
64109:Messrs Kissinger
64573:Henry kissinger
94259:Henry Kissinger eine
94593:"H. Kissinger\n"
94700:Henry A. Kissinger - Vertreter eines

Filtering SpaCy’s candidates into actual named entities (there are seven people intermingled in these strings) will likely require a mixture of human and machine labor.

<2022-02-19 Sat>

There are not 96,000 distinct names in this sample, even though it is a sample of 27,000 pages. This is one of the places where using uncorrected (“dirty”) OCR hampers our endeavors. Past that fortuitous group at the top of the list, the entries become very dirty indeed:

D. Signature
"Jerzy\n"
ieeiier rrr iri rir
"Wee\n"
Wdiinad Pugh
William Peters
E. List
James E. Doyle
"Fe es ee ee eee\n"
New Yor
ak ae
sald
Wolff
Li mucn
juirice Greenbaum
AL VK
MAURICE C. GREENBAUM
L. KAT
Madison Ave
Svetlana

There are a number of options to consider here.

  1. Pre-filter the pages. We know that some of the pages are too dirty to yield any recognizable text. (The purple mimeographs are an example, as, of course are hand-written pages, drawings, poor-quality photocopies, and so on.) If we had a way to detect those, we could skip trying to find named entities in a see of garbage.
  2. Train a better model.
  3. Use tools like OpenRefine to clean the data by hand.

A combination of techniques will probably be required.

<2022-02-22 Tue>

Some simple regular-expression-based filtering whittles the list down from 96,000 to 72,000. Clustering with OpenRefine will also be powerful.

Clustering is a technique commonly used in natural language processing. It entails finding groups of strings that are similar to one another, using various algorithms to calculate similarity. For example, George Kennan and George Kennen are very similar, because they differ by only one letter; with our data, we can say with great confidence that instances of the string George Kennen should be corrected to be George Kennan, thus reducing the number of name strings from two to one.

Other comparisons are not so straightforward. Suppose we are comparing F. L. Smith with F. T. Smith: are these two distinct people, or is one of these strings a mis-spelling of the other? Sometimes, if we know our data, we can make a good guess: John P. Kennedy is almost certainly John F. Kennedy. In other cases, we cannot tell without looking at the original context.

OpenRefine lets you apply half a dozen different clustering algorithms, each of which uses a different heuristic to calculate similarity. In practice, one applies each of them successively; for our experiment so far, I’ve just used the key-collision algorithms, which bring the list down to about 22,000 entries.

<2022-03-02 Wed>

After another round with OpenRefine, we’re down to about 22,000 name candidates. I’ve started to keep a few snapshot lists in a Google Spreadsheet.

The results, so far, are disappointing. Clustering is a very effective technique, often used in text processing, but it does take time and human labor. At this stage, in a production context, one would probably assign a student (with an archivist to consult with) to perform more painstaking iterations over the data to winnow out partial names and mis-recognized strings and produce a working list of names.

Some observations:

  • there are many German words and phrases in this list. I suspect the two-capitalized-words-in-a-row heuristic is responsible for these; I will do some research to see if there are standard techniques to handle this problem, which must be a common one.
  • during these clustering/merging steps with OpenRefine, we’ve lost context: the string-by-string links back to canvases. There will be ways to do that, but they will require more overhead than we want to spend now.

<2022-03-03 Thu>

OpenRefine’s clustering algorithms are indeed powerful, but there is simply too much kruft in this data set: nonsensical strings and whatnot. Let’s see if we can improve SpaCy’s NER model to give us more accurate results to start with.

I’m using Prodigy, a companion to SpaCy, developed by the same company. Prodigy is an annotation tool that uses machine learning to train data models. It isn’t free, but I have a research license.

We’ll begin by gathering training data. I haven’t been keeping the OCR output but we can do that easily enough. In fact, we’ll use SpaCy to generate data sets in one of SpaCy’s preferred data formats. And we’ll extend our object models to include metadata about the collection, the container, and the page.

Here’s an example of some training data in jsonl format:

{"text": "Lhe As for the rest of the Soviet Union: the situation that prevails there is both dreadful and dangerous.", "meta": {"Date Created": ["1991 February 3"], "Extent": ["1 folder"], "Identifier": ["ark:/88435/d504rt661"], "Title": ["\"If the Kremlin Can't Rule,\" Op-Ed about the Baltics, The Washington Post "], "Creator": ["Kennan, George F. (George Frost), 1904-2005."], "Language": ["English"], "Publisher": ["Kennan, George F. (George Frost), 1904-2005."], "Portion Note": ["entire component, excluding the C Section of the Washington Post, Feb 3, 1991"], "Container": ["Box 294, Folder 4"], "Rendered Holding Location": ["Mudd Manuscript Library"], "Member Of Collections": ["George F. Kennan Papers MC076"]}}
{"text": "If it is true, as it appears to be, that the supply of consumers’ goods to the larger cities cannot be assured without the wholehearted collaboration of the party apparatus and the armed units in the great rural hinterland of the country, then one could understand why Gorbachev has felt himself compelled to reach back at this time for the support of those institutions.", "meta": {"Date Created": ["1991 February 3"], "Extent": ["1 folder"], "Identifier": ["ark:/88435/d504rt661"], "Title": ["\"If the Kremlin Can't Rule,\" Op-Ed about the Baltics, The Washington Post "], "Creator": ["Kennan, George F. (George Frost), 1904-2005."], "Language": ["English"], "Publisher": ["Kennan, George F. (George Frost), 1904-2005."], "Portion Note": ["entire component, excluding the C Section of the Washington Post, Feb 3, 1991"], "Container": ["Box 294, Folder 4"], "Rendered Holding Location": ["Mudd Manuscript Library"], "Member Of Collections": ["George F. Kennan Papers MC076"]}}

Let’s try training on some of this data.

prodigy ner.manual ner_cold_war_papers blank:en ~/Desktop/training2/ea9a223d-e23c-4d86-894a-4164902ffc3b.jsonl --label PERSON

<2022-03-04 Fri> Review

What have we accomplished so far?

  • We have developed software that enables us to build, in an unattended fashion, datasets of candidate named entities from pages, containers, and entire collections, based on Figgy’s IIIF manifests.
  • We have developed a data model that enables us to represent this (meta)data as annotations to IIIF canvases, thereby integrating it with Figgy’s underlying data model and the IIIF software base (viewers, annotation servers) already developed by ITMS.
  • We have begun to analyze the data that results from naive applications of NLP software.

Unsurprisingly, the brute-force naive approach we’ve applied so far is unsatisfactory: it produces too much noise. How can we improve these results so that we can produce a useful set of infrequent names?

Be smarter about what you look at.
Our tools naively process every page in the collection. Some of that data may not be useful or relevant (drafts of published works; newspaper clippings; handwritten notes (which cannot yet be processed with OCR); other ephemera. In a reality, an archivist would pre-select the components of the collection that are most amenable to this kind of analysis.

We also apply NER to the OCR output without checking on its quality: if we could throw out pages that were poorly recognized (again, hand-written materials; mimeographs; other bad originals), we might improve our overall NER: less garbage in, less garbage out.

Take smaller bites.
Archival collections are naturally sub-divided into thematically related components and sub-components. We are likely to get better results if we used those subdivisions to our advantage: to make hand-correction tractable; to train models iteratively.

Next Steps

  • Filter out poor OCR. Use confidence thresholds produced by Tesseract. Unfortunately, that means we can’t use the OCR already produced by Figgy.
  • Be selective in what we process. Use the Collection’s Indexes to produce training data. Concentrate on the Correspondence series.
  • Some containers might be amenable to image cleanup to improve OCR.
  • Augment our training set with more patterns. Will & Alexis have provided some name lists to help train our model, but we can expand that training set using some common NLP techniques

<2022-03-07 Mon>

Correspondence is a good set to work with. Correspondence usually has lots of names; the names will likely vary by correspondent (the social network formed by names mentioned in correspondence would probably be interesting); and there’s a lot of it in the Kennan Papers. We’ll start with subseries 1A, because much of it has been digitized.

Series 1, Subseries 1A: Permanent Correspondence, 1918-2004

There are 658 files in subseries 1A, including an index:

  • Index of permanent files, undated

    This index is an excellent data set for training; we’ll look at that in a minute. But first, let’s work on making the base data (the OCR output) better.

    OCR engines (like Tesseract) can produce plain-text output, but they usually can do much more. We’ve seen how Tesseract can serialize the text it recognizes as hOCR or ALTO, but it can also generate a detailed table of data as output, data that includes confidence scores for each word and each block of text it discovers. A confidence score is a measure of how accurately the program thinks it has recognized the word (or block, or even character) correctly. We know now, from experience, that if the OCR is poor, the NER will be poor, so if we can filter out text that has been badly OCR’d, our NER accuracy should improve.

    Deciding where to set the threshold may require some trial and error. Based on some research, it looks like setting the cutoff somewhere between 97.5 and 98.5 is common in real-world applications. Let’s try both ends and see what happens.

    <2022-03-07 Mon> Those numbers don’t work at the block level; too many blocks get rejected. Something closer to 55 seems to be in the right range, but this may not be the best way; perhaps it will be better to process at the word level.

Clone this wiki locally