Kreuzberg

A document intelligence framework for Python. Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.

📖 Complete Documentation

Framework Overview

Document Intelligence Capabilities

Text Extraction: High-fidelity text extraction preserving document structure and formatting
Metadata Extraction: Comprehensive metadata including author, creation date, language, and document properties
Format Support: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
OCR Integration: Multiple OCR engines (Tesseract, EasyOCR, PaddleOCR) with automatic fallback
Table Detection: Structured table extraction with cell-level precision via GMFT integration
Document Classification: Automatic document type detection (contracts, forms, invoices, receipts, reports)

Technical Architecture

Performance: Highest throughput among Python document processing frameworks (30+ docs/second)
Resource Efficiency: 71MB installation, ~360MB runtime memory footprint
Extensibility: Plugin architecture for custom extractors via the Extractor base class
API Design: Synchronous and asynchronous APIs with consistent interfaces
Type Safety: Complete type annotations throughout the codebase

Open Source Foundation

Kreuzberg leverages established open source technologies:

Pandoc: Universal document converter for robust format support
PDFium: Google's PDF rendering engine for accurate PDF processing
Tesseract: Google's OCR engine for text recognition
Python-docx/pptx: Native Microsoft Office format support

Quick Start

Extract Text with CLI

# Extract text from any file to text format
uvx kreuzberg extract document.pdf > output.txt

# With all features (OCR, table extraction, etc.)
uvx --from "kreuzberg[all]" kreuzberg extract invoice.pdf --ocr-backend tesseract --output-format text

# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --output-format json

Python Usage

Async (recommended for web apps):

from kreuzberg import extract_file

# In your async function
result = await extract_file("presentation.pptx")
print(result.content)

# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")

Sync (for scripts and CLI tools):

from kreuzberg import extract_file_sync

result = extract_file_sync("report.docx")
print(result.content)

# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")

Docker

# Run the REST API
docker run -p 8000:8000 goldziher/kreuzberg

# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract

📖 Installation Guide • CLI Documentation • API Reference

Deployment Options

🤖 MCP Server (AI Integration)

Add to Claude Desktop with one command:

claude mcp add kreuzberg uvx -- --from "kreuzberg[all]" kreuzberg-mcp

Or configure manually in claude_desktop_config.json:

{
  "mcpServers": {
    "kreuzberg": {
      "command": "uvx",
      "args": ["--from", "kreuzberg[all]", "kreuzberg-mcp"]
    }
  }
}

MCP capabilities:

Extract text from PDFs, images, Office docs, and more
Full OCR support with multiple engines
Table extraction and metadata parsing

📖 MCP Documentation

Supported Formats

Category	Formats
Documents	PDF, DOCX, DOC, RTF, TXT, EPUB
Images	JPG, PNG, TIFF, BMP, GIF, WEBP
Spreadsheets	XLSX, XLS, CSV, ODS
Presentations	PPTX, PPT, ODP
Web	HTML, XML, MHTML
Archives	Support via extraction

📊 Performance Characteristics

View comprehensive benchmarks • Benchmark methodology • Detailed Analysis

Technical Specifications

Metric	Kreuzberg Sync	Kreuzberg Async	Benchmarked
Throughput (tiny files)	31.78 files/s	23.94 files/s	Highest throughput
Throughput (small files)	8.91 files/s	9.31 files/s	Highest throughput
Memory footprint	359.8 MB	395.2 MB	Lowest usage
Installation size	71 MB	71 MB	Smallest size
Success rate	100%	100%	Perfect
Supported formats	18	18	Comprehensive

Architecture Advantages

Native C extensions: Built on PDFium and Tesseract for maximum performance
Async/await support: True asynchronous processing with intelligent task scheduling
Memory efficiency: Streaming architecture minimizes memory allocation
Process pooling: Automatic multiprocessing for CPU-intensive operations
Optimized data flow: Efficient data handling with minimal transformations

Benchmark details: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.

Documentation

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 486 Commits
.docker		.docker
.github		.github
benchmarks		benchmarks
docs		docs
kreuzberg		kreuzberg
tests		tests
.commitlintrc		.commitlintrc
.deepsource.toml		.deepsource.toml
.dockerignore		.dockerignore
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
ai-rulez.yaml		ai-rulez.yaml
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kreuzberg

Framework Overview

Document Intelligence Capabilities

Technical Architecture

Open Source Foundation

Quick Start

Extract Text with CLI

Python Usage

Docker

Deployment Options

🤖 MCP Server (AI Integration)

Supported Formats

📊 Performance Characteristics

Technical Specifications

Architecture Advantages

Documentation

Quick Links

License

About

Uh oh!

Releases 43

Packages

Uh oh!

Contributors 12

Languages

License

Goldziher/kreuzberg

Folders and files

Latest commit

History

Repository files navigation

Kreuzberg

Framework Overview

Document Intelligence Capabilities

Technical Architecture

Open Source Foundation

Quick Start

Extract Text with CLI

Python Usage

Docker

Deployment Options

🤖 MCP Server (AI Integration)

Supported Formats

📊 Performance Characteristics

Technical Specifications

Architecture Advantages

Documentation

Quick Links

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 43

Packages 0

Uh oh!

Contributors 12

Languages

Packages