A Python package for correcting OCR and text outputs using Large Language Models (LLMs). Named after Kleio (or Clio), the muse of history, this package helps in making historical and scanned texts more accurate and readable.
- Intelligent OCR using Tesseract with automatic mode selection
- OCR text correction using LLMs
- Text normalization and cleaning
- Support for multiple LLM backends
- Historical text processing capabilities
- Batch processing support
- Structured logging with both logfire and standard logging
- Detailed OCR confidence scoring with PSM mode metadata
- Semantic text chunking with configurable overlap
- Recursive text summarization using LLMs
- Map-reduce summarization with provenance tracking
# Install dependencies
poetry install
# Set up configuration
cp .env.example .env
# Edit .env with your settings:
# - LOGFIRE_SEND_TO_LOGFIRE and LOGFIRE_TOKEN for logfire logging (optional)
# - TESSERACT_CMD for custom tesseract path
# - OCR_LANGUAGE for non-English OCR
# - OCR_DPI for PDF conversion quality
# - KLEIO_CHUNK_MAX_TOKENS and KLEIO_CHUNK_SENTENCE_OVERLAP for chunking
# - KLEIO_SUMMARY_TARGET_LENGTH, KLEIO_SUMMARY_MIN_CHUNKS, and
# KLEIO_SUMMARY_GROUP_SIZE for summarization
Docker setup is optional and provides an isolated environment with all dependencies pre-configured.
-
Install Docker:
# Ubuntu/Debian sudo apt-get update sudo apt-get install docker.io docker-compose-plugin # macOS brew install docker docker-compose # Windows # Download Docker Desktop from https://www.docker.com/products/docker-desktop
-
Build and run:
# Build the Docker image docker compose build # Create a .env file from the example cp .env.example .env # Edit .env with your settings
The system supports two logging backends:
-
Logfire (Optional)
- Set
LOGFIRE_SEND_TO_LOGFIRE=true
andLOGFIRE_TOKEN
in .env - Provides structured logging with Pydantic integration
- Automatic model validation logging
- Web dashboard for log analysis
- Gracefully falls back to standard logging if not configured
- Robust configuration validation and error handling
- Set
-
Standard Logging (Always available)
- JSON-formatted logs to stdout
- Includes context and metadata
- No external dependencies
- Automatically used when Logfire is disabled or misconfigured
from kleio.ocr import OCRProcessor
# Initialize the OCR processor
processor = OCRProcessor()
# Process different types of files
# Images (PNG, JPEG, etc.)
doc = processor.process_file("document.png")
# PDFs (with or without text layer)
doc = processor.process_file("document.pdf")
# Word documents
doc = processor.process_file("document.docx")
# Plain text files
doc = processor.process_file("document.txt")
# Access extracted text with layout information
for element in doc.elements:
print(f"Content: {element.content}")
print(f"Position: x={element.position.x}, y={element.position.y}")
print(f"Size: {element.position.width}x{element.position.height}")
if element.confidence:
print(f"OCR Confidence: {element.confidence.overall:.2%}")
# Get plain text content
text = doc.get_plain_text()
# Get elements by type
tables = doc.get_elements_by_type(ContentType.TABLE)
text_blocks = doc.get_elements_by_type(ContentType.TEXT)
# Get elements by page
page_1_elements = doc.get_elements_on_page(1)
from kleio.chunker import TextChunker
from kleio.models import ChunkConfig
from kleio.summarizer import RecursiveSummarizer
import dspy
# Configure DSPy with LLM (supports OpenAI, Anthropic, and local LLMs)
from kleio.llm_config import get_configured_lm
# LLM will be configured based on environment variables:
# For OpenAI:
# OPENAI_LLM_MODEL="gpt-4o-2024-08-06"
# OPENAI_API_KEY="your-key"
# For Anthropic:
# ANTHROPIC_LLM_MODEL="claude-3-opus-20240229"
# ANTHROPIC_API_KEY="your-key"
# For local LLM:
# USE_LOCAL_LLM=true
# LOCAL_LLM_URL="http://localhost:8000/v1"
# LOCAL_LLM_KEY="hosted_vllm"
# LOCAL_LLM_MODEL="meta-llama/Llama-3.2-3B-Instruct"
lm = get_configured_lm()
dspy.configure(lm=lm)
# Or configure explicitly:
from kleio.llm_config import configure_lm
# For OpenAI
lm = configure_lm(
provider='openai',
model='gpt-4o-2024-08-06',
key='your-key'
)
# For Anthropic
lm = configure_lm(
provider='anthropic',
model='claude-3-opus-20240229',
key='your-key'
)
# For local LLM
lm = configure_lm(
provider='local',
model='meta-llama/Llama-3.2-3B-Instruct',
url='http://localhost:8000/v1',
key='hosted_vllm'
)
# Initialize chunker with custom configuration
config = ChunkConfig(
max_tokens=500, # Maximum tokens per chunk
sentence_overlap=2, # Number of sentences to overlap between chunks
preserve_sentences=True # Don't split sentences across chunks
)
chunker = TextChunker(config)
# Chunk your text while preserving semantic meaning
chunked_text = chunker.chunk_text(long_text)
# Access chunks and metadata
for chunk in chunked_text.chunks:
print(f"Chunk {chunk.sequence_number}:")
print(f"Content: {chunk.content}")
print(f"Token count: {chunk.token_count}")
print(f"Overlaps with chunks: {chunk.overlapped_with}")
# Get statistics
print(f"Total tokens: {chunked_text.total_tokens}")
print(f"Unique tokens: {chunked_text.unique_tokens}")
# Create recursive summarizer
summarizer = RecursiveSummarizer(
target_summary_length=100, # Target length for final summary
min_chunks_to_summarize=2, # Minimum chunks needed to trigger summarization
chunk_group_size=3 # Number of chunks to group for summarization
)
# Generate hierarchical summary
summarized = summarizer.summarize(chunked_text)
# Access summary levels (from most detailed to most summarized)
for level in summarized.levels:
print(f"\nLevel {level.level}:")
print(f"Contains {len(level.chunks)} chunks")
print(f"Total tokens: {level.total_tokens}")
# Show summaries for levels above base
if level.level > 0:
for chunk in level.chunks:
print(f"\nSummary: {chunk.content}")
print(f"Source chunks: {level.source_chunks[chunk.sequence_number-1]}")
from kleio.corrector import TextCorrector
# Initialize the corrector (automatically loads example corrections)
corrector = TextCorrector()
# Simple text correction
text, modifications = corrector.correct("Tho quick br0wn f0x")
print(text) # "The quick brown fox"
print(modifications) # [("Tho", "The"), ("br0wn", "brown")]
# Context-aware correction using document hierarchy
doc = chunker.chunk_document(document)
doc = summarizer.summarize(doc) # Creates context hierarchy
corrected_doc = corrector.batch_correct(doc)
# View corrections with their context
for chunk in corrected_doc.chunks.chunks:
print(f"\nChunk {chunk.sequence_number}:")
print(f"Content: {chunk.content}")
if chunk.metadata.get("modifications"):
print("Corrections made:")
for orig, corr in chunk.metadata["modifications"]:
print(f" {orig} -> {corr}")
The corrector uses multiple sources to improve accuracy:
- Example corrections from a curated dataset
- Document context through summary hierarchy
- Modification tracking for analysis
Configuration options:
corrector = TextCorrector({
"corrections_file": "path/to/corrections.jsonl", # Example corrections
"model_name": "gpt-4" # LLM to use
})
The library provides a complete pipeline for processing historical documents through OCR, chunking, and summarization:
from pathlib import Path
import dspy
from kleio.ocr import OCRProcessor
from kleio.chunker import TextChunker
from kleio.summarizer import RecursiveSummarizer
from kleio.models import ChunkConfig
from kleio.llm_config import get_configured_lm
# Configure DSPy with environment-based LLM
dspy.settings.configure(lm=get_configured_lm())
# Initialize processors
processor = OCRProcessor()
chunker = TextChunker(
ChunkConfig(
max_tokens=100,
preserve_sentences=True,
sentence_overlap=1
)
)
summarizer = RecursiveSummarizer(
target_summary_length=50,
min_chunks_to_summarize=2,
chunk_group_size=3
)
# Process document through pipeline
doc = processor.process_file("historical_document.tif")
print(f"Extracted {len(doc.elements)} text elements")
# Apply chunking
doc = chunker.chunk_document(doc)
print(f"Chunked into {len(doc.chunks.chunks)} chunks")
# Generate hierarchical summary for context
doc = summarizer.summarize(doc)
print("\nSummary hierarchy created")
# Apply context-aware OCR correction
doc = corrector.batch_correct(doc)
print("\nCorrections made:")
for chunk in doc.chunks.chunks:
if chunk.metadata.get("modifications"):
print(f"\nChunk {chunk.sequence_number}:")
for orig, corr in chunk.metadata["modifications"]:
print(f" {orig} -> {corr}")
# Print summary hierarchy
for level in doc.summary.levels:
print(f"\nLevel {level.level} ({level.total_tokens} tokens):")
for i, chunk in enumerate(level.chunks, 1):
print(f"\nChunk {i}:")
print(chunk.content)
print(f"Source chunks: {level.source_chunks[i-1]}")
The pipeline supports various input formats:
- Images (JPEG, PNG, TIFF)
- PDFs (with or without text layer)
- Word documents
- Plain text files
Each stage provides detailed logging and analysis:
- OCR: Per-block confidence scores and position information
- Chunking: Token counts and overlap tracking
- Correction: Modification tracking with context awareness
- Summarization: Hierarchical summaries with provenance tracking
This project can be developed either locally using Poetry or with Docker.
-
Install dependencies:
poetry install
-
Run tests:
poetry run pytest
-
Build and start the container:
# If you have Docker group permissions: docker compose build docker compose up # If you need sudo: sudo docker compose build sudo docker compose up
-
Run tests in Docker:
# If you have Docker group permissions: docker compose run kleio poetry run pytest # If you need sudo: sudo docker compose run kleio poetry run pytest
-
Run Python shell in Docker:
# If you have Docker group permissions: docker compose run kleio poetry run python # If you need sudo: sudo docker compose run kleio poetry run python
-
Override default command:
# If you have Docker group permissions: docker compose run kleio poetry run python your_script.py # If you need sudo: sudo docker compose run kleio poetry run python your_script.py
Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)