Skip to content

This repo contains files downloaded from Transkribus with corresponding suggested OCR improvements (performed using ChatGPT AI).

Notifications You must be signed in to change notification settings

sarahalang/LLM-powered-OCR-correction

Repository files navigation

LLM-powered-OCR-correction

This repo contains files downloaded from Transkribus with corresponding suggested OCR improvements (performed using ChatGPT API).

It follows the editor-in-the-loop approach to using LLM assistants in Digital Humanities work. The OCR/HTR resulting from Transkribus (NOSCEMUS GM4-6) is of high quality, containing only 5-10 minor mistakes (in the "order of magnitude" of slightly incorrect letters or a missing space). The LLM assistance is used to speed up the manual correction process which can take up to 10 minutes per page even if one tries not to actually read the page while correcting (despite the fact that the HTR is already of really high quality). This step is necessary to bring the text up to edition quality, however, I want to retain full control over the final output as an editor, so I came up with this workflow that would automate generating a LLM-powered improvement suggestion for the OCR that is compared (using gitdiff in color mode locally and the automatic Github history visualization here). This gives me a quick visual overview where the necessary changes are located in the text. I will then fix the changes I agree with in Transkribus and do any other necessary edits there that will allow me to download suitable TEI-XML from Transkribus. (This is quite good but still needs some improvement afterwards to fully meet my needs.)


Practical information

Having one of my historical books (ca. 200-300p) proofread like that cost ca. $0.50. Setting up this workflow took a few hours once, running the script on new data takes a while in the background but is quite comfortable.

Currently, I still need ca. 5-10 minutes to implement the changes manually in Transkribus. Proofreading manually takes ca. 10 minutes per pages, however, without doing any structural edits (that are included in the time mentioned before).


OCR Post-Correction Automation with Custom GPT

This repository contains a Python script that automates the process of sending OCR outputs to a custom GPT model for post-correction. The script processes .txt files generated in Transkribus (download .txt as 'one per page') and their corresponding image files, sending them to a GPT model using OpenAI's API. The corrected text is saved as new .txt files in an output directory for easy comparison.

Directory Structure

The directory structure should be organized as follows:

project_root/
│
├── project_keys/
│   ├── ocr-key.txt             # add OpenAI API key
│   └── OCR-Correction.txt      # add project ID (for reference, if needed)
│
├── data/
│   └── ocr-correction/
│       └── [PDF_name]/
│           ├── transkribus-output/   # Contains the OCR output text files (.txt)
│           ├── images/               # Contains the corresponding images for the text pages
│           └── gpt-output/           # Output directory for the corrected text files
│
├── korr-ocr.py                    # The main Python script for processing

File Details:

  • ocr-key.txt: Contains your OpenAI API key.
  • OCR-Correction.txt: Stores the project ID, though this is not required by the API call.
  • transkribus-output/: Stores the text files generated by the OCR process. Each file represents one page of the document.
  • images/: Contains image files corresponding to the OCR text files, with matching filenames.
  • gpt-output/: This directory is created by the script to store the post-corrected text files, using the same filenames as the original OCR files.

Requirements

To run this script, you will need:

  • Python 3.x
  • The following Python libraries installed:
    • openai
    • os

To install the required library, you can use:

pip install openai

Usage

  1. Set up your API keys: Place your OpenAI API key in a file called ocr-key.txt inside the project_keys/ directory.

  2. Prepare the data:

    • Place the OCR text files in the transkribus-output/ directory within the subdirectory named after the PDF document.
    • Place the corresponding images for each text file in the images/ directory with the same filenames as the text files.
  3. Run the Script: Run the script using Python:

    python korr-ocr.py

    You will be prompted to input the name of the sub-directory (representing a PDF) you want to process. For example:

    Enter the name of the OCR sub-directory (PDF file) you want to process: my_document

    The script will process each OCR page from the transkribus-output/ directory, send the text and corresponding image in batches to your custom GPT, and store the corrected outputs in the gpt-output/ directory.

Example

For a document stored in the folder my_document, the structure should be as follows:

data/
└── ocr-correction/
    └── my_document/
        ├── transkribus-output/
        │   ├── page_1.txt
        │   ├── page_2.txt
        │   └── page_3.txt
        ├── images/
        │   ├── page_1.png
        │   ├── page_2.png
        │   └── page_3.png
        └── gpt-output/     # This will be populated after running the script

The processed text files will be saved in the gpt-output/ directory within the same sub-directory as the OCR input files, and the filenames will match the original OCR text files. For example, after processing, you'll see files like:

gpt-output/
├── page_1.txt
├── page_2.txt
└── page_3.txt

Creating the git diff commands

Then, the script generate-diff.py is used to automatically create the diff commands to compare transcriptions for individual pages (taking the directory structure into account). Unfortunately, two commands are required per page as the first one does not visualize missing whitespaces, for instance.

# Comparing '0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt' in 'Arcana'
git diff --color-words=. data/ocr-correction/Arcana/transkribus-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt data/ocr-correction/Arcana/gpt-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt;
git diff --word-diff=color --word-diff-regex="[ ]+|[^ ]+" data/ocr-correction/Arcana/transkribus-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt data/ocr-correction/Arcana/gpt-output/0122_hab-arcana-drucke_248-quod-1s_00001-eb04-122.txt;

Sometimes, the GPT output produces ^M line endings that result in some distracting output because gitdiff doesn't actually allow them. You could fix this in the GPT prompt or just ignore them.

Here are two examples images of what the result would look like:

  • character-level errors output of the first command
  • word-level errors (including whitespace errors) output of the second command (whitespace)

For your reference, here is the facsimile image: original image for which the transcription is generated (To run the script, it would need to be in data/ocr-correction/Arcana/images, not where it is currently placed in the directory structure of this github repo.)

And this is what it would look like if you directly committed the gpt-output over the original transkribus-txt in github (to have it auto-visualized in the version history functionality, see here): example of diff in github version history

Helper script

The script what-gpts-available-for-apiaccess.py shows you which OpenAI models are available to you through the API. Unfortunately, this may not be the same as in your Pro account (if you have one) and the billing is also separate. Thankfully, I have found that GPT 3.5. turbo works best for my use case (and it is quite cheap).


This README and the code were created with assistance from ChatGPT 4o and o1-preview.

About

This repo contains files downloaded from Transkribus with corresponding suggested OCR improvements (performed using ChatGPT AI).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages