Kleio

A Python package for correcting OCR and text outputs using Large Language Models (LLMs). Named after Kleio (or Clio), the muse of history, this package helps in making historical and scanned texts more accurate and readable.

Features

Intelligent OCR using Tesseract with automatic mode selection
OCR text correction using LLMs
Text normalization and cleaning
Support for multiple LLM backends
Historical text processing capabilities
Batch processing support
Structured logging with both logfire and standard logging
Detailed OCR confidence scoring with PSM mode metadata
Semantic text chunking with configurable overlap
Recursive text summarization using LLMs
Map-reduce summarization with provenance tracking

Installation

Using Poetry (Local Development)

# Install dependencies
poetry install

# Set up configuration
cp .env.example .env
# Edit .env with your settings:
# - LOGFIRE_SEND_TO_LOGFIRE and LOGFIRE_TOKEN for logfire logging (optional)
# - TESSERACT_CMD for custom tesseract path
# - OCR_LANGUAGE for non-English OCR
# - OCR_DPI for PDF conversion quality
# - KLEIO_CHUNK_MAX_TOKENS and KLEIO_CHUNK_SENTENCE_OVERLAP for chunking
# - KLEIO_SUMMARY_TARGET_LENGTH, KLEIO_SUMMARY_MIN_CHUNKS, and 
#   KLEIO_SUMMARY_GROUP_SIZE for summarization

Using Docker (Optional)

Docker setup is optional and provides an isolated environment with all dependencies pre-configured.

Install Docker:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker.io docker-compose-plugin

# macOS
brew install docker docker-compose

# Windows
# Download Docker Desktop from https://www.docker.com/products/docker-desktop

Build and run:

# Build the Docker image
docker compose build

# Create a .env file from the example
cp .env.example .env
# Edit .env with your settings

Logging Configuration

The system supports two logging backends:

Logfire (Optional)
- Set LOGFIRE_SEND_TO_LOGFIRE=true and LOGFIRE_TOKEN in .env
- Provides structured logging with Pydantic integration
- Automatic model validation logging
- Web dashboard for log analysis
- Gracefully falls back to standard logging if not configured
- Robust configuration validation and error handling
Standard Logging (Always available)
- JSON-formatted logs to stdout
- Includes context and metadata
- No external dependencies
- Automatically used when Logfire is disabled or misconfigured

Usage

OCR Processing

from kleio.ocr import OCRProcessor

# Initialize the OCR processor
processor = OCRProcessor()

# Process different types of files
# Images (PNG, JPEG, etc.)
doc = processor.process_file("document.png")

# PDFs (with or without text layer)
doc = processor.process_file("document.pdf")

# Word documents
doc = processor.process_file("document.docx")

# Plain text files
doc = processor.process_file("document.txt")

# Access extracted text with layout information
for element in doc.elements:
    print(f"Content: {element.content}")
    print(f"Position: x={element.position.x}, y={element.position.y}")
    print(f"Size: {element.position.width}x{element.position.height}")
    if element.confidence:
        print(f"OCR Confidence: {element.confidence.overall:.2%}")

# Get plain text content
text = doc.get_plain_text()

# Get elements by type
tables = doc.get_elements_by_type(ContentType.TABLE)
text_blocks = doc.get_elements_by_type(ContentType.TEXT)

# Get elements by page
page_1_elements = doc.get_elements_on_page(1)

Text Chunking and Summarization

from kleio.chunker import TextChunker
from kleio.models import ChunkConfig
from kleio.summarizer import RecursiveSummarizer
import dspy

# Configure DSPy with LLM (supports OpenAI, Anthropic, and local LLMs)
from kleio.llm_config import get_configured_lm

# LLM will be configured based on environment variables:
# For OpenAI:
#   OPENAI_LLM_MODEL="gpt-4o-2024-08-06"
#   OPENAI_API_KEY="your-key"
# For Anthropic:
#   ANTHROPIC_LLM_MODEL="claude-3-opus-20240229"
#   ANTHROPIC_API_KEY="your-key"
# For local LLM:
#   USE_LOCAL_LLM=true
#   LOCAL_LLM_URL="http://localhost:8000/v1"
#   LOCAL_LLM_KEY="hosted_vllm"
#   LOCAL_LLM_MODEL="meta-llama/Llama-3.2-3B-Instruct"

lm = get_configured_lm()
dspy.configure(lm=lm)

# Or configure explicitly:
from kleio.llm_config import configure_lm

# For OpenAI
lm = configure_lm(
    provider='openai',
    model='gpt-4o-2024-08-06',
    key='your-key'
)

# For Anthropic
lm = configure_lm(
    provider='anthropic',
    model='claude-3-opus-20240229',
    key='your-key'
)

# For local LLM
lm = configure_lm(
    provider='local',
    model='meta-llama/Llama-3.2-3B-Instruct',
    url='http://localhost:8000/v1',
    key='hosted_vllm'
)

# Initialize chunker with custom configuration
config = ChunkConfig(
    max_tokens=500,           # Maximum tokens per chunk
    sentence_overlap=2,       # Number of sentences to overlap between chunks
    preserve_sentences=True   # Don't split sentences across chunks
)
chunker = TextChunker(config)

# Chunk your text while preserving semantic meaning
chunked_text = chunker.chunk_text(long_text)

# Access chunks and metadata
for chunk in chunked_text.chunks:
    print(f"Chunk {chunk.sequence_number}:")
    print(f"Content: {chunk.content}")
    print(f"Token count: {chunk.token_count}")
    print(f"Overlaps with chunks: {chunk.overlapped_with}")

# Get statistics
print(f"Total tokens: {chunked_text.total_tokens}")
print(f"Unique tokens: {chunked_text.unique_tokens}")

# Create recursive summarizer
summarizer = RecursiveSummarizer(
    target_summary_length=100,  # Target length for final summary
    min_chunks_to_summarize=2,  # Minimum chunks needed to trigger summarization
    chunk_group_size=3         # Number of chunks to group for summarization
)

# Generate hierarchical summary
summarized = summarizer.summarize(chunked_text)

# Access summary levels (from most detailed to most summarized)
for level in summarized.levels:
    print(f"\nLevel {level.level}:")
    print(f"Contains {len(level.chunks)} chunks")
    print(f"Total tokens: {level.total_tokens}")
    
    # Show summaries for levels above base
    if level.level > 0:
        for chunk in level.chunks:
            print(f"\nSummary: {chunk.content}")
            print(f"Source chunks: {level.source_chunks[chunk.sequence_number-1]}")

Text Correction

from kleio.corrector import TextCorrector

# Initialize the corrector (automatically loads example corrections)
corrector = TextCorrector()

# Simple text correction
text, modifications = corrector.correct("Tho quick br0wn f0x")
print(text)  # "The quick brown fox"
print(modifications)  # [("Tho", "The"), ("br0wn", "brown")]

# Context-aware correction using document hierarchy
doc = chunker.chunk_document(document)
doc = summarizer.summarize(doc)  # Creates context hierarchy
corrected_doc = corrector.batch_correct(doc)

# View corrections with their context
for chunk in corrected_doc.chunks.chunks:
    print(f"\nChunk {chunk.sequence_number}:")
    print(f"Content: {chunk.content}")
    if chunk.metadata.get("modifications"):
        print("Corrections made:")
        for orig, corr in chunk.metadata["modifications"]:
            print(f"  {orig} -> {corr}")

The corrector uses multiple sources to improve accuracy:

Example corrections from a curated dataset
Document context through summary hierarchy
Modification tracking for analysis

Configuration options:

corrector = TextCorrector({
    "corrections_file": "path/to/corrections.jsonl",  # Example corrections
    "model_name": "gpt-4"  # LLM to use
})

Processing Historical Documents

The library provides a complete pipeline for processing historical documents through OCR, chunking, and summarization:

from pathlib import Path
import dspy
from kleio.ocr import OCRProcessor
from kleio.chunker import TextChunker
from kleio.summarizer import RecursiveSummarizer
from kleio.models import ChunkConfig
from kleio.llm_config import get_configured_lm

# Configure DSPy with environment-based LLM
dspy.settings.configure(lm=get_configured_lm())

# Initialize processors
processor = OCRProcessor()
chunker = TextChunker(
    ChunkConfig(
        max_tokens=100,
        preserve_sentences=True,
        sentence_overlap=1
    )
)
summarizer = RecursiveSummarizer(
    target_summary_length=50,
    min_chunks_to_summarize=2,
    chunk_group_size=3
)

# Process document through pipeline
doc = processor.process_file("historical_document.tif")
print(f"Extracted {len(doc.elements)} text elements")

# Apply chunking
doc = chunker.chunk_document(doc)
print(f"Chunked into {len(doc.chunks.chunks)} chunks")

# Generate hierarchical summary for context
doc = summarizer.summarize(doc)
print("\nSummary hierarchy created")

# Apply context-aware OCR correction
doc = corrector.batch_correct(doc)
print("\nCorrections made:")
for chunk in doc.chunks.chunks:
    if chunk.metadata.get("modifications"):
        print(f"\nChunk {chunk.sequence_number}:")
        for orig, corr in chunk.metadata["modifications"]:
            print(f"  {orig} -> {corr}")

# Print summary hierarchy
for level in doc.summary.levels:
    print(f"\nLevel {level.level} ({level.total_tokens} tokens):")
    for i, chunk in enumerate(level.chunks, 1):
        print(f"\nChunk {i}:")
        print(chunk.content)
        print(f"Source chunks: {level.source_chunks[i-1]}")

The pipeline supports various input formats:

Images (JPEG, PNG, TIFF)
PDFs (with or without text layer)
Word documents
Plain text files

Each stage provides detailed logging and analysis:

OCR: Per-block confidence scores and position information
Chunking: Token counts and overlap tracking
Correction: Modification tracking with context awareness
Summarization: Hierarchical summaries with provenance tracking

Development

This project can be developed either locally using Poetry or with Docker.

Local Setup

Install dependencies:
```
poetry install
```
Run tests:
```
poetry run pytest
```

Docker Setup

Build and start the container:

# If you have Docker group permissions:
docker compose build
docker compose up

# If you need sudo:
sudo docker compose build
sudo docker compose up

Run tests in Docker:

# If you have Docker group permissions:
docker compose run kleio poetry run pytest

# If you need sudo:
sudo docker compose run kleio poetry run pytest

Run Python shell in Docker:

# If you have Docker group permissions:
docker compose run kleio poetry run python

# If you need sudo:
sudo docker compose run kleio poetry run python

Override default command:

# If you have Docker group permissions:
docker compose run kleio poetry run python your_script.py

# If you need sudo:
sudo docker compose run kleio poetry run python your_script.py

License

Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
data		data
examples		examples
kleio		kleio
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
codecov.yml		codecov.yml
docker-compose.yml		docker-compose.yml
get-docker.sh		get-docker.sh
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kleio

Features

Installation

Using Poetry (Local Development)

Using Docker (Optional)

Logging Configuration

Usage

OCR Processing

Text Chunking and Summarization

Text Correction

Processing Historical Documents

Development

Local Setup

Docker Setup

License

About

Uh oh!

Releases

Packages

Languages

License

neumanns-workshop/kleio

Folders and files

Latest commit

History

Repository files navigation

Kleio

Features

Installation

Using Poetry (Local Development)

Using Docker (Optional)

Logging Configuration

Usage

OCR Processing

Text Chunking and Summarization

Text Correction

Processing Historical Documents

Development

Local Setup

Docker Setup

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages