Skip to content

neumanns-workshop/kleio

Repository files navigation

Kleio

A Python package for correcting OCR and text outputs using Large Language Models (LLMs). Named after Kleio (or Clio), the muse of history, this package helps in making historical and scanned texts more accurate and readable.

Tests codecov Python Version License: CC BY-NC 4.0 Poetry

Features

  • Intelligent OCR using Tesseract with automatic mode selection
  • OCR text correction using LLMs
  • Text normalization and cleaning
  • Support for multiple LLM backends
  • Historical text processing capabilities
  • Batch processing support
  • Structured logging with both logfire and standard logging
  • Detailed OCR confidence scoring with PSM mode metadata
  • Semantic text chunking with configurable overlap
  • Recursive text summarization using LLMs
  • Map-reduce summarization with provenance tracking

Installation

Using Poetry (Local Development)

# Install dependencies
poetry install

# Set up configuration
cp .env.example .env
# Edit .env with your settings:
# - LOGFIRE_SEND_TO_LOGFIRE and LOGFIRE_TOKEN for logfire logging (optional)
# - TESSERACT_CMD for custom tesseract path
# - OCR_LANGUAGE for non-English OCR
# - OCR_DPI for PDF conversion quality
# - KLEIO_CHUNK_MAX_TOKENS and KLEIO_CHUNK_SENTENCE_OVERLAP for chunking
# - KLEIO_SUMMARY_TARGET_LENGTH, KLEIO_SUMMARY_MIN_CHUNKS, and 
#   KLEIO_SUMMARY_GROUP_SIZE for summarization

Using Docker (Optional)

Docker setup is optional and provides an isolated environment with all dependencies pre-configured.

  1. Install Docker:

    # Ubuntu/Debian
    sudo apt-get update
    sudo apt-get install docker.io docker-compose-plugin
    
    # macOS
    brew install docker docker-compose
    
    # Windows
    # Download Docker Desktop from https://www.docker.com/products/docker-desktop
  2. Build and run:

    # Build the Docker image
    docker compose build
    
    # Create a .env file from the example
    cp .env.example .env
    # Edit .env with your settings

Logging Configuration

The system supports two logging backends:

  1. Logfire (Optional)

    • Set LOGFIRE_SEND_TO_LOGFIRE=true and LOGFIRE_TOKEN in .env
    • Provides structured logging with Pydantic integration
    • Automatic model validation logging
    • Web dashboard for log analysis
    • Gracefully falls back to standard logging if not configured
    • Robust configuration validation and error handling
  2. Standard Logging (Always available)

    • JSON-formatted logs to stdout
    • Includes context and metadata
    • No external dependencies
    • Automatically used when Logfire is disabled or misconfigured

Usage

OCR Processing

from kleio.ocr import OCRProcessor

# Initialize the OCR processor
processor = OCRProcessor()

# Process different types of files
# Images (PNG, JPEG, etc.)
doc = processor.process_file("document.png")

# PDFs (with or without text layer)
doc = processor.process_file("document.pdf")

# Word documents
doc = processor.process_file("document.docx")

# Plain text files
doc = processor.process_file("document.txt")

# Access extracted text with layout information
for element in doc.elements:
    print(f"Content: {element.content}")
    print(f"Position: x={element.position.x}, y={element.position.y}")
    print(f"Size: {element.position.width}x{element.position.height}")
    if element.confidence:
        print(f"OCR Confidence: {element.confidence.overall:.2%}")

# Get plain text content
text = doc.get_plain_text()

# Get elements by type
tables = doc.get_elements_by_type(ContentType.TABLE)
text_blocks = doc.get_elements_by_type(ContentType.TEXT)

# Get elements by page
page_1_elements = doc.get_elements_on_page(1)

Text Chunking and Summarization

from kleio.chunker import TextChunker
from kleio.models import ChunkConfig
from kleio.summarizer import RecursiveSummarizer
import dspy

# Configure DSPy with LLM (supports OpenAI, Anthropic, and local LLMs)
from kleio.llm_config import get_configured_lm

# LLM will be configured based on environment variables:
# For OpenAI:
#   OPENAI_LLM_MODEL="gpt-4o-2024-08-06"
#   OPENAI_API_KEY="your-key"
# For Anthropic:
#   ANTHROPIC_LLM_MODEL="claude-3-opus-20240229"
#   ANTHROPIC_API_KEY="your-key"
# For local LLM:
#   USE_LOCAL_LLM=true
#   LOCAL_LLM_URL="http://localhost:8000/v1"
#   LOCAL_LLM_KEY="hosted_vllm"
#   LOCAL_LLM_MODEL="meta-llama/Llama-3.2-3B-Instruct"

lm = get_configured_lm()
dspy.configure(lm=lm)

# Or configure explicitly:
from kleio.llm_config import configure_lm

# For OpenAI
lm = configure_lm(
    provider='openai',
    model='gpt-4o-2024-08-06',
    key='your-key'
)

# For Anthropic
lm = configure_lm(
    provider='anthropic',
    model='claude-3-opus-20240229',
    key='your-key'
)

# For local LLM
lm = configure_lm(
    provider='local',
    model='meta-llama/Llama-3.2-3B-Instruct',
    url='http://localhost:8000/v1',
    key='hosted_vllm'
)

# Initialize chunker with custom configuration
config = ChunkConfig(
    max_tokens=500,           # Maximum tokens per chunk
    sentence_overlap=2,       # Number of sentences to overlap between chunks
    preserve_sentences=True   # Don't split sentences across chunks
)
chunker = TextChunker(config)

# Chunk your text while preserving semantic meaning
chunked_text = chunker.chunk_text(long_text)

# Access chunks and metadata
for chunk in chunked_text.chunks:
    print(f"Chunk {chunk.sequence_number}:")
    print(f"Content: {chunk.content}")
    print(f"Token count: {chunk.token_count}")
    print(f"Overlaps with chunks: {chunk.overlapped_with}")

# Get statistics
print(f"Total tokens: {chunked_text.total_tokens}")
print(f"Unique tokens: {chunked_text.unique_tokens}")

# Create recursive summarizer
summarizer = RecursiveSummarizer(
    target_summary_length=100,  # Target length for final summary
    min_chunks_to_summarize=2,  # Minimum chunks needed to trigger summarization
    chunk_group_size=3         # Number of chunks to group for summarization
)

# Generate hierarchical summary
summarized = summarizer.summarize(chunked_text)

# Access summary levels (from most detailed to most summarized)
for level in summarized.levels:
    print(f"\nLevel {level.level}:")
    print(f"Contains {len(level.chunks)} chunks")
    print(f"Total tokens: {level.total_tokens}")
    
    # Show summaries for levels above base
    if level.level > 0:
        for chunk in level.chunks:
            print(f"\nSummary: {chunk.content}")
            print(f"Source chunks: {level.source_chunks[chunk.sequence_number-1]}")

Text Correction

from kleio.corrector import TextCorrector

# Initialize the corrector (automatically loads example corrections)
corrector = TextCorrector()

# Simple text correction
text, modifications = corrector.correct("Tho quick br0wn f0x")
print(text)  # "The quick brown fox"
print(modifications)  # [("Tho", "The"), ("br0wn", "brown")]

# Context-aware correction using document hierarchy
doc = chunker.chunk_document(document)
doc = summarizer.summarize(doc)  # Creates context hierarchy
corrected_doc = corrector.batch_correct(doc)

# View corrections with their context
for chunk in corrected_doc.chunks.chunks:
    print(f"\nChunk {chunk.sequence_number}:")
    print(f"Content: {chunk.content}")
    if chunk.metadata.get("modifications"):
        print("Corrections made:")
        for orig, corr in chunk.metadata["modifications"]:
            print(f"  {orig} -> {corr}")

The corrector uses multiple sources to improve accuracy:

  • Example corrections from a curated dataset
  • Document context through summary hierarchy
  • Modification tracking for analysis

Configuration options:

corrector = TextCorrector({
    "corrections_file": "path/to/corrections.jsonl",  # Example corrections
    "model_name": "gpt-4"  # LLM to use
})

Processing Historical Documents

The library provides a complete pipeline for processing historical documents through OCR, chunking, and summarization:

from pathlib import Path
import dspy
from kleio.ocr import OCRProcessor
from kleio.chunker import TextChunker
from kleio.summarizer import RecursiveSummarizer
from kleio.models import ChunkConfig
from kleio.llm_config import get_configured_lm

# Configure DSPy with environment-based LLM
dspy.settings.configure(lm=get_configured_lm())

# Initialize processors
processor = OCRProcessor()
chunker = TextChunker(
    ChunkConfig(
        max_tokens=100,
        preserve_sentences=True,
        sentence_overlap=1
    )
)
summarizer = RecursiveSummarizer(
    target_summary_length=50,
    min_chunks_to_summarize=2,
    chunk_group_size=3
)

# Process document through pipeline
doc = processor.process_file("historical_document.tif")
print(f"Extracted {len(doc.elements)} text elements")

# Apply chunking
doc = chunker.chunk_document(doc)
print(f"Chunked into {len(doc.chunks.chunks)} chunks")

# Generate hierarchical summary for context
doc = summarizer.summarize(doc)
print("\nSummary hierarchy created")

# Apply context-aware OCR correction
doc = corrector.batch_correct(doc)
print("\nCorrections made:")
for chunk in doc.chunks.chunks:
    if chunk.metadata.get("modifications"):
        print(f"\nChunk {chunk.sequence_number}:")
        for orig, corr in chunk.metadata["modifications"]:
            print(f"  {orig} -> {corr}")

# Print summary hierarchy
for level in doc.summary.levels:
    print(f"\nLevel {level.level} ({level.total_tokens} tokens):")
    for i, chunk in enumerate(level.chunks, 1):
        print(f"\nChunk {i}:")
        print(chunk.content)
        print(f"Source chunks: {level.source_chunks[i-1]}")

The pipeline supports various input formats:

  • Images (JPEG, PNG, TIFF)
  • PDFs (with or without text layer)
  • Word documents
  • Plain text files

Each stage provides detailed logging and analysis:

  • OCR: Per-block confidence scores and position information
  • Chunking: Token counts and overlap tracking
  • Correction: Modification tracking with context awareness
  • Summarization: Hierarchical summaries with provenance tracking

Development

This project can be developed either locally using Poetry or with Docker.

Local Setup

  1. Install dependencies:

    poetry install
  2. Run tests:

    poetry run pytest

Docker Setup

  1. Build and start the container:

    # If you have Docker group permissions:
    docker compose build
    docker compose up
    
    # If you need sudo:
    sudo docker compose build
    sudo docker compose up
  2. Run tests in Docker:

    # If you have Docker group permissions:
    docker compose run kleio poetry run pytest
    
    # If you need sudo:
    sudo docker compose run kleio poetry run pytest
  3. Run Python shell in Docker:

    # If you have Docker group permissions:
    docker compose run kleio poetry run python
    
    # If you need sudo:
    sudo docker compose run kleio poetry run python
  4. Override default command:

    # If you have Docker group permissions:
    docker compose run kleio poetry run python your_script.py
    
    # If you need sudo:
    sudo docker compose run kleio poetry run python your_script.py

License

Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)

About

Text Correction with LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published