This repo contains files downloaded from Transkribus with corresponding suggested OCR improvements (performed using ChatGPT API).
It follows the editor-in-the-loop approach to using LLM assistants in Digital Humanities work.
The OCR/HTR resulting from Transkribus (NOSCEMUS GM4-6) is of high quality, containing only 5-10 minor mistakes (in the "order of magnitude" of slightly incorrect letters or a missing space).
The LLM assistance is used to speed up the manual correction process which can take up to 10 minutes per page even if one tries not to actually read the page while correcting (despite the fact that the HTR is already of really high quality).
This step is necessary to bring the text up to edition quality, however, I want to retain full control over the final output as an editor, so I came up with this workflow that would automate generating a LLM-powered improvement suggestion for the OCR that is compared (using gitdiff
in color mode locally and the automatic Github history visualization here).
This gives me a quick visual overview where the necessary changes are located in the text. I will then fix the changes I agree with in Transkribus and do any other necessary edits there that will allow me to download suitable TEI-XML from Transkribus. (This is quite good but still needs some improvement afterwards to fully meet my needs.)
Having one of my historical books (ca. 200-300p) proofread like that cost ca. $0.50. Setting up this workflow took a few hours once, running the script on new data takes a while in the background but is quite comfortable.
Currently, I still need ca. 5-10 minutes to implement the changes manually in Transkribus. Proofreading manually takes ca. 10 minutes per pages, however, without doing any structural edits (that are included in the time mentioned before).
This repository contains a Python script that automates the process of sending OCR outputs to a custom GPT model for post-correction. The script processes .txt
files generated in Transkribus (download .txt
as 'one per page') and their corresponding image files, sending them to a GPT model using OpenAI's API. The corrected text is saved as new .txt
files in an output directory for easy comparison.
The directory structure should be organized as follows:
project_root/
│
├── project_keys/
│ ├── ocr-key.txt # add OpenAI API key
│ └── OCR-Correction.txt # add project ID (for reference, if needed)
│
├── data/
│ └── ocr-correction/
│ └── [PDF_name]/
│ ├── transkribus-output/ # Contains the OCR output text files (.txt)
│ ├── images/ # Contains the corresponding images for the text pages
│ └── gpt-output/ # Output directory for the corrected text files
│
├── korr-ocr.py # The main Python script for processing
ocr-key.txt
: Contains your OpenAI API key.OCR-Correction.txt
: Stores the project ID, though this is not required by the API call.transkribus-output/
: Stores the text files generated by the OCR process. Each file represents one page of the document.images/
: Contains image files corresponding to the OCR text files, with matching filenames.gpt-output/
: This directory is created by the script to store the post-corrected text files, using the same filenames as the original OCR files.
To run this script, you will need:
- Python 3.x
- The following Python libraries installed:
openai
os
To install the required library, you can use:
pip install openai
-
Set up your API keys: Place your OpenAI API key in a file called
ocr-key.txt
inside theproject_keys/
directory. -
Prepare the data:
- Place the OCR text files in the
transkribus-output/
directory within the subdirectory named after the PDF document. - Place the corresponding images for each text file in the
images/
directory with the same filenames as the text files.
- Place the OCR text files in the
-
Run the Script: Run the script using Python:
python korr-ocr.py
You will be prompted to input the name of the sub-directory (representing a PDF) you want to process. For example:
Enter the name of the OCR sub-directory (PDF file) you want to process: my_document
The script will process each OCR page from the
transkribus-output/
directory, send the text and corresponding image in batches to your custom GPT, and store the corrected outputs in thegpt-output/
directory.
For a document stored in the folder my_document
, the structure should be as follows:
data/
└── ocr-correction/
└── my_document/
├── transkribus-output/
│ ├── page_1.txt
│ ├── page_2.txt
│ └── page_3.txt
├── images/
│ ├── page_1.png
│ ├── page_2.png
│ └── page_3.png
└── gpt-output/ # This will be populated after running the script
The processed text files will be saved in the gpt-output/
directory within the same sub-directory as the OCR input files, and the filenames will match the original OCR text files. For example, after processing, you'll see files like:
gpt-output/
├── page_1.txt
├── page_2.txt
└── page_3.txt
Then, the script generate-diff.py
is used to automatically create the diff commands to compare transcriptions for individual pages (taking the directory structure into account).
Unfortunately, two commands are required per page as the first one does not visualize missing whitespaces, for instance.
# Comparing '0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt' in 'Arcana'
git diff --color-words=. data/ocr-correction/Arcana/transkribus-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt data/ocr-correction/Arcana/gpt-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt;
git diff --word-diff=color --word-diff-regex="[ ]+|[^ ]+" data/ocr-correction/Arcana/transkribus-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt data/ocr-correction/Arcana/gpt-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt;
Sometimes, the GPT output produces ^M
line endings that result in some distracting output because gitdiff
doesn't actually allow them.
You could fix this in the GPT prompt or just ignore them.
Here are two examples images of what the result would look like:
For your reference, here is the facsimile image:
(To run the script, it would need to be in
data/ocr-correction/Arcana/images
, not where it is currently placed in the directory structure of this github repo.)
And this is what it would look like if you directly committed the gpt-output over the original transkribus-txt in github (to have it auto-visualized in the version history functionality, see here):
The script what-gpts-available-for-apiaccess.py
shows you which OpenAI models are available to you through the API. Unfortunately, this may not be the same as in your Pro account (if you have one) and the billing is also separate. Thankfully, I have found that GPT 3.5. turbo works best for my use case (and it is quite cheap).
This README and the code were created with assistance from ChatGPT 4o and o1-preview.