LLM-powered-OCR-correction

This repo contains files downloaded from Transkribus with corresponding suggested OCR improvements (performed using ChatGPT API).

It follows the editor-in-the-loop approach to using LLM assistants in Digital Humanities work. The OCR/HTR resulting from Transkribus (NOSCEMUS GM4-6) is of high quality, containing only 5-10 minor mistakes (in the "order of magnitude" of slightly incorrect letters or a missing space). The LLM assistance is used to speed up the manual correction process which can take up to 10 minutes per page even if one tries not to actually read the page while correcting (despite the fact that the HTR is already of really high quality). This step is necessary to bring the text up to edition quality, however, I want to retain full control over the final output as an editor, so I came up with this workflow that would automate generating a LLM-powered improvement suggestion for the OCR that is compared (using gitdiff in color mode locally and the automatic Github history visualization here). This gives me a quick visual overview where the necessary changes are located in the text. I will then fix the changes I agree with in Transkribus and do any other necessary edits there that will allow me to download suitable TEI-XML from Transkribus. (This is quite good but still needs some improvement afterwards to fully meet my needs.)

Practical information

Having one of my historical books (ca. 200-300p) proofread like that cost ca. $0.50. Setting up this workflow took a few hours once, running the script on new data takes a while in the background but is quite comfortable.

Currently, I still need ca. 5-10 minutes to implement the changes manually in Transkribus. Proofreading manually takes ca. 10 minutes per pages, however, without doing any structural edits (that are included in the time mentioned before).

OCR Post-Correction Automation with Custom GPT

This repository contains a Python script that automates the process of sending OCR outputs to a custom GPT model for post-correction. The script processes .txt files generated in Transkribus (download .txt as 'one per page') and their corresponding image files, sending them to a GPT model using OpenAI's API. The corrected text is saved as new .txt files in an output directory for easy comparison.

Directory Structure

The directory structure should be organized as follows:

project_root/
│
├── project_keys/
│   ├── ocr-key.txt             # add OpenAI API key
│   └── OCR-Correction.txt      # add project ID (for reference, if needed)
│
├── data/
│   └── ocr-correction/
│       └── [PDF_name]/
│           ├── transkribus-output/   # Contains the OCR output text files (.txt)
│           ├── images/               # Contains the corresponding images for the text pages
│           └── gpt-output/           # Output directory for the corrected text files
│
├── korr-ocr.py                    # The main Python script for processing

File Details:

ocr-key.txt: Contains your OpenAI API key.
OCR-Correction.txt: Stores the project ID, though this is not required by the API call.
transkribus-output/: Stores the text files generated by the OCR process. Each file represents one page of the document.
images/: Contains image files corresponding to the OCR text files, with matching filenames.
gpt-output/: This directory is created by the script to store the post-corrected text files, using the same filenames as the original OCR files.

Requirements

To run this script, you will need:

Python 3.x
The following Python libraries installed:
- openai
- os

To install the required library, you can use:

pip install openai

Usage

Set up your API keys: Place your OpenAI API key in a file called ocr-key.txt inside the project_keys/ directory.
Prepare the data:
- Place the OCR text files in the transkribus-output/ directory within the subdirectory named after the PDF document.
- Place the corresponding images for each text file in the images/ directory with the same filenames as the text files.
Run the Script: Run the script using Python:
```
python korr-ocr.py
```
You will be prompted to input the name of the sub-directory (representing a PDF) you want to process. For example:
```
Enter the name of the OCR sub-directory (PDF file) you want to process: my_document
```
The script will process each OCR page from the transkribus-output/ directory, send the text and corresponding image in batches to your custom GPT, and store the corrected outputs in the gpt-output/ directory.

Example

For a document stored in the folder my_document, the structure should be as follows:

data/
└── ocr-correction/
    └── my_document/
        ├── transkribus-output/
        │   ├── page_1.txt
        │   ├── page_2.txt
        │   └── page_3.txt
        ├── images/
        │   ├── page_1.png
        │   ├── page_2.png
        │   └── page_3.png
        └── gpt-output/     # This will be populated after running the script

The processed text files will be saved in the gpt-output/ directory within the same sub-directory as the OCR input files, and the filenames will match the original OCR text files. For example, after processing, you'll see files like:

gpt-output/
├── page_1.txt
├── page_2.txt
└── page_3.txt

Creating the git diff commands

Then, the script generate-diff.py is used to automatically create the diff commands to compare transcriptions for individual pages (taking the directory structure into account). Unfortunately, two commands are required per page as the first one does not visualize missing whitespaces, for instance.

# Comparing '0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt' in 'Arcana'
git diff --color-words=. data/ocr-correction/Arcana/transkribus-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt data/ocr-correction/Arcana/gpt-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt;
git diff --word-diff=color --word-diff-regex="[ ]+|[^ ]+" data/ocr-correction/Arcana/transkribus-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt data/ocr-correction/Arcana/gpt-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt;

Sometimes, the GPT output produces ^M line endings that result in some distracting output because gitdiff doesn't actually allow them. You could fix this in the GPT prompt or just ignore them.

Here are two examples images of what the result would look like:

character-level errors
word-level errors (including whitespace errors)

For your reference, here is the facsimile image: (To run the script, it would need to be in data/ocr-correction/Arcana/images, not where it is currently placed in the directory structure of this github repo.)

And this is what it would look like if you directly committed the gpt-output over the original transkribus-txt in github (to have it auto-visualized in the version history functionality, see here):

Helper script

The script what-gpts-available-for-apiaccess.py shows you which OpenAI models are available to you through the API. Unfortunately, this may not be the same as in your Pro account (if you have one) and the billing is also separate. Thankfully, I have found that GPT 3.5. turbo works best for my use case (and it is quite cheap).

This README and the code were created with assistance from ChatGPT 4o and o1-preview.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
DHd2025-workshop-symbole-postcorrection		DHd2025-workshop-symbole-postcorrection
data/ocr-correction/Arcana		data/ocr-correction/Arcana
0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.png		0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.png
README.md		README.md
generate-diff.py		generate-diff.py
github-diff.png		github-diff.png
korr-ocr.py		korr-ocr.py
models-available-in-api.txt		models-available-in-api.txt
ocr-diff-1.png		ocr-diff-1.png
ocr-diff-2.png		ocr-diff-2.png
what-gpts-available-for-apiaccess.py		what-gpts-available-for-apiaccess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-powered-OCR-correction

Practical information

OCR Post-Correction Automation with Custom GPT

Directory Structure

File Details:

Requirements

Usage

Example

Creating the git diff commands

Helper script

About

Uh oh!

Releases

Packages

Languages

sarahalang/LLM-powered-OCR-correction

Folders and files

Latest commit

History

Repository files navigation

LLM-powered-OCR-correction

Practical information

OCR Post-Correction Automation with Custom GPT

Directory Structure

File Details:

Requirements

Usage

Example

Creating the git diff commands

Helper script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages