Redoc is a powerful, modular document conversion framework that enables seamless transformation between various document formats including PDF, HTML, XML, JSON, DOCX, and EPUB. It features OCR capabilities, AI-powered content generation using Ollama Mistral:7b, and a bidirectional template system for document generation and data extraction.
- Multi-format Support: Bidirectional conversion between PDF, HTML, XML, JSON, DOCX, and EPUB
- Template System: JSON+HTML templates for dynamic document generation with bidirectional support
- OCR Integration: Extract text from scanned documents and images with Tesseract OCR
- AI-Powered: Leverage Ollama Mistral:7b for intelligent content generation and processing
- Bidirectional Processing: Convert documents to data and back with templates
- Batch Processing: Process multiple documents efficiently with parallel execution
- Template Variables: Support for dynamic content and conditional rendering
- Validation: Built-in data validation with Pydantic models
- Extensible Architecture: Plugin system for custom formats and processors
- Asynchronous Processing: Non-blocking operations for high performance
- Web Interface: Modern UI for document conversion and management
- Comprehensive API: Clean, well-documented Python API
- Command Line Interface: Intuitive CLI for quick conversions
- Interactive Shell: Built-in Python shell for exploration and debugging
- Logging & Debugging: Configurable logging and error reporting
- Type Hints: Full type annotations for better IDE support
- Docker Support: Containerized deployment with Docker and Docker Compose
- REST API: Built with FastAPI for easy integration
- Asynchronous Processing: Non-blocking operations for high performance
- Security: Input validation, sanitization, and secure defaults
- Monitoring: Built-in metrics and health checks
# Install the latest stable version
pip install redoc
# Install with all optional dependencies
pip install "redoc[all]"
# Or install specific components
pip install "redoc[cli]" # Command line interface
pip install "redoc[server]" # Web server and API
pip install "redoc[ai]" # AI features (requires Ollama)
pip install "redoc[ocr]" # OCR capabilities (Tesseract)
pip install "redoc[templates]" # Pre-built templates
# Pull the latest image
docker pull text2doc/redoc:latest
# Run a conversion
docker run -v $(pwd):/data text2doc/redoc convert input.pdf output.html
# Start the web interface
docker run -p 8000:8000 -v $(pwd)/templates:/app/templates text2doc/redoc serve
git clone https://github.com/text2doc/redoc.git
cd redoc
pip install -e ".[dev]" # Install in development mode with all dependencies
pre-commit install # Install git hooks
# Convert a document
redoc convert input.pdf output.html
# Convert with a template
redoc convert --template invoice.html data.json invoice.pdf
# Start interactive shell
redoc shell
# Start web server
redoc serve
from redoc import Redoc
# Initialize with default settings
converter = Redoc()
# Convert between formats
converter.convert('document.pdf', 'document.html') # PDF to HTML
converter.convert('data.json', 'report.pdf') # JSON to PDF with template
# Process multiple files
converter.batch_convert(
input_glob='invoices/*.json',
output_dir='output/',
output_format='pdf',
template='invoice.html'
)
# Extract data from documents
data = converter.extract_data('document.pdf', 'invoice_schema.json')
# Generate documents from templates
converter.generate_document(
template='invoice.html',
data='data.json',
output='invoice.pdf'
)
# Use the interactive shell
converter.shell()
# Show help
redoc --help
# Convert a document
redoc convert input.pdf output.html
redoc convert --template invoice.html data.json invoice.pdf
# Start interactive shell
redoc shell
# Start web server
redoc serve --host 0.0.0.0 --port 8000
# Process multiple files
redoc batch "documents/*.pdf" --format html --output-dir html_output
from redoc import Redoc
converter = Redoc()
# Simple template with variables
template = {
"template": "invoice.html",
"data": {
"invoice": {
"number": "INV-2023-001",
"date": "2023-11-15",
"items": [
{"description": "Web Design", "quantity": 10, "price": 100},
{"description": "Hosting", "quantity": 1, "price": 50}
]
}
}
}
# Generate PDF from template
converter.convert(template, 'pdf', output_file='invoice.pdf')
# Extract data from document
data = converter.extract_data('invoice.pdf', template='invoice_template.html')
From \ To | HTML | XML | JSON | DOCX | EPUB | |
---|---|---|---|---|---|---|
β | β | β | β | β | β | |
HTML | β | β | β | β | β | β |
XML | β | β | β | β | β | β |
JSON | β | β | β | β | β | β |
DOCX | β | β | β | β | β | β |
EPUB | β | β | β | β | β | β |
- PDF Generation: High-quality PDF output with support for headers, footers, and page numbers
- HTML Processing: Clean HTML output with customizable CSS styling
- Data Extraction: Extract structured data from documents using templates
- Template Variables: Use Jinja2 syntax for dynamic content
- Batch Processing: Process multiple files in parallel
- OCR Support: Extract text from scanned documents and images
- AI-Powered: Enhance documents with AI-generated content
redoc/
βββ src/
β βββ redoc/
β βββ __init__.py # Package initialization
β βββ core.py # Core conversion logic
β βββ converters/ # Format-specific converters
β β βββ base.py # Base converter class
β β βββ pdf_converter.py
β β βββ html_converter.py
β β βββ xml_converter.py
β β βββ json_converter.py
β β βββ docx_converter.py
β β βββ epub_converter.py
β βββ ocr/ # OCR functionality
β βββ templates/ # Default templates
β βββ utils/ # Utility functions
βββ tests/ # Test suite
βββ examples/ # Usage examples
βββ docs/ # Documentation
βββ pyproject.toml # Project configuration
βββ README.md # This file
from redoc import Redoc
converter = Redoc()
# Convert JSON+HTML template to PDF
converter.convert(
{
"template": "invoice.html",
"data": {
"invoice_number": "INV-2023-001",
"date": "2023-11-15",
"items": [
{"description": "Web Design", "quantity": 1, "price": 1200}
],
"total": 1200
}
},
'pdf',
output_file='invoice.pdf'
)
from redoc import Redoc
converter = Redoc()
# Extract text from scanned PDF with OCR
result = converter.ocr('scanned_document.pdf')
print(result['text'])
# Convert scanned document to searchable PDF
converter.ocr('scanned_document.pdf', output_file='searchable.pdf')
from redoc import Redoc
converter = Redoc()
# Generate document using AI
result = converter.generate(
"Create a professional invoice for web design services",
format='pdf',
style='professional',
output_file='ai_invoice.pdf'
)
We have an exciting roadmap ahead! Check out our TODO list for upcoming features and improvements. Here are some highlights:
- Fixing pyproject.toml TOML syntax error
- Resolving MkDocs build warnings
- Enhancing documentation
- More template examples
- Improved AI features
- Performance optimizations
- Additional document format support
Contributions are welcome! Please read our Contributing Guidelines for details on how to contribute to this project.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
For any questions or suggestions, please contact info@softreck.dev.