Skip to content

text2doc/redoc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ Redoc - Universal Document Converter

PyPI Version Python Version License Documentation Status Build Status Test Coverage Code Style Docker Pulls Downloads CodeQL pre-commit OpenSSF Scorecard Discord Twitter Follow

Redoc is a powerful, modular document conversion framework that enables seamless transformation between various document formats including PDF, HTML, XML, JSON, DOCX, and EPUB. It features OCR capabilities, AI-powered content generation using Ollama Mistral:7b, and a bidirectional template system for document generation and data extraction.

🌟 Features

Core Functionality

  • Multi-format Support: Bidirectional conversion between PDF, HTML, XML, JSON, DOCX, and EPUB
  • Template System: JSON+HTML templates for dynamic document generation with bidirectional support
  • OCR Integration: Extract text from scanned documents and images with Tesseract OCR
  • AI-Powered: Leverage Ollama Mistral:7b for intelligent content generation and processing
  • Bidirectional Processing: Convert documents to data and back with templates
  • Batch Processing: Process multiple documents efficiently with parallel execution

Advanced Capabilities

  • Template Variables: Support for dynamic content and conditional rendering
  • Validation: Built-in data validation with Pydantic models
  • Extensible Architecture: Plugin system for custom formats and processors
  • Asynchronous Processing: Non-blocking operations for high performance
  • Web Interface: Modern UI for document conversion and management

Developer Experience

  • Comprehensive API: Clean, well-documented Python API
  • Command Line Interface: Intuitive CLI for quick conversions
  • Interactive Shell: Built-in Python shell for exploration and debugging
  • Logging & Debugging: Configurable logging and error reporting
  • Type Hints: Full type annotations for better IDE support

Enterprise Ready

  • Docker Support: Containerized deployment with Docker and Docker Compose
  • REST API: Built with FastAPI for easy integration
  • Asynchronous Processing: Non-blocking operations for high performance
  • Security: Input validation, sanitization, and secure defaults
  • Monitoring: Built-in metrics and health checks

πŸš€ Quick START

Installation

Using pip (recommended)

# Install the latest stable version
pip install redoc

# Install with all optional dependencies
pip install "redoc[all]"

# Or install specific components
pip install "redoc[cli]"       # Command line interface
pip install "redoc[server]"     # Web server and API
pip install "redoc[ai]"         # AI features (requires Ollama)
pip install "redoc[ocr]"        # OCR capabilities (Tesseract)
pip install "redoc[templates]"  # Pre-built templates

Using Docker (recommended for production)

# Pull the latest image
docker pull text2doc/redoc:latest

# Run a conversion
docker run -v $(pwd):/data text2doc/redoc convert input.pdf output.html

# Start the web interface
docker run -p 8000:8000 -v $(pwd)/templates:/app/templates text2doc/redoc serve

Development Installation

git clone https://github.com/text2doc/redoc.git
cd redoc
pip install -e ".[dev]"  # Install in development mode with all dependencies
pre-commit install  # Install git hooks

πŸ›  Basic Usage

Command Line Interface

# Convert a document
redoc convert input.pdf output.html

# Convert with a template
redoc convert --template invoice.html data.json invoice.pdf

# Start interactive shell
redoc shell

# Start web server
redoc serve

Python API

from redoc import Redoc

# Initialize with default settings
converter = Redoc()

# Convert between formats
converter.convert('document.pdf', 'document.html')  # PDF to HTML
converter.convert('data.json', 'report.pdf')       # JSON to PDF with template

# Process multiple files
converter.batch_convert(
    input_glob='invoices/*.json',
    output_dir='output/',
    output_format='pdf',
    template='invoice.html'
)

# Extract data from documents
data = converter.extract_data('document.pdf', 'invoice_schema.json')

# Generate documents from templates
converter.generate_document(
    template='invoice.html',
    data='data.json',
    output='invoice.pdf'
)

# Use the interactive shell
converter.shell()

Command Line Interface

# Show help
redoc --help

# Convert a document
redoc convert input.pdf output.html
redoc convert --template invoice.html data.json invoice.pdf

# Start interactive shell
redoc shell

# Start web server
redoc serve --host 0.0.0.0 --port 8000

# Process multiple files
redoc batch "documents/*.pdf" --format html --output-dir html_output

Using Templates

from redoc import Redoc

converter = Redoc()

# Simple template with variables
template = {
    "template": "invoice.html",
    "data": {
        "invoice": {
            "number": "INV-2023-001",
            "date": "2023-11-15",
            "items": [
                {"description": "Web Design", "quantity": 10, "price": 100},
                {"description": "Hosting", "quantity": 1, "price": 50}
            ]
        }
    }
}

# Generate PDF from template
converter.convert(template, 'pdf', output_file='invoice.pdf')

# Extract data from document
data = converter.extract_data('invoice.pdf', template='invoice_template.html')

πŸ“š Supported Conversions

From \ To PDF HTML XML JSON DOCX EPUB
PDF ❌ βœ… βœ… βœ… βœ… βœ…
HTML βœ… ❌ βœ… βœ… βœ… βœ…
XML βœ… βœ… ❌ βœ… βœ… βœ…
JSON βœ… βœ… βœ… ❌ βœ… βœ…
DOCX βœ… βœ… βœ… βœ… ❌ βœ…
EPUB βœ… βœ… βœ… βœ… βœ… ❌

Conversion Features

  • PDF Generation: High-quality PDF output with support for headers, footers, and page numbers
  • HTML Processing: Clean HTML output with customizable CSS styling
  • Data Extraction: Extract structured data from documents using templates
  • Template Variables: Use Jinja2 syntax for dynamic content
  • Batch Processing: Process multiple files in parallel
  • OCR Support: Extract text from scanned documents and images
  • AI-Powered: Enhance documents with AI-generated content

πŸ—οΈ Project Structure

redoc/
β”œβ”€β”€ src/
β”‚   └── redoc/
β”‚       β”œβ”€β”€ __init__.py          # Package initialization
β”‚       β”œβ”€β”€ core.py             # Core conversion logic
β”‚       β”œβ”€β”€ converters/         # Format-specific converters
β”‚       β”‚   β”œβ”€β”€ base.py         # Base converter class
β”‚       β”‚   β”œβ”€β”€ pdf_converter.py
β”‚       β”‚   β”œβ”€β”€ html_converter.py
β”‚       β”‚   β”œβ”€β”€ xml_converter.py
β”‚       β”‚   β”œβ”€β”€ json_converter.py
β”‚       β”‚   β”œβ”€β”€ docx_converter.py
β”‚       β”‚   └── epub_converter.py
β”‚       β”œβ”€β”€ ocr/                # OCR functionality
β”‚       β”œβ”€β”€ templates/          # Default templates
β”‚       └── utils/              # Utility functions
β”œβ”€β”€ tests/                      # Test suite
β”œβ”€β”€ examples/                   # Usage examples
β”œβ”€β”€ docs/                       # Documentation
β”œβ”€β”€ pyproject.toml              # Project configuration
└── README.md                   # This file

πŸ”§ Advanced Usage

Using Templates

from redoc import Redoc

converter = Redoc()

# Convert JSON+HTML template to PDF
converter.convert(
    {
        "template": "invoice.html",
        "data": {
            "invoice_number": "INV-2023-001",
            "date": "2023-11-15",
            "items": [
                {"description": "Web Design", "quantity": 1, "price": 1200}
            ],
            "total": 1200
        }
    },
    'pdf',
    output_file='invoice.pdf'
)

OCR Processing

from redoc import Redoc

converter = Redoc()

# Extract text from scanned PDF with OCR
result = converter.ocr('scanned_document.pdf')
print(result['text'])

# Convert scanned document to searchable PDF
converter.ocr('scanned_document.pdf', output_file='searchable.pdf')

AI-Powered Content Generation

from redoc import Redoc

converter = Redoc()

# Generate document using AI
result = converter.generate(
    "Create a professional invoice for web design services",
    format='pdf',
    style='professional',
    output_file='ai_invoice.pdf'
)

🚧 Next Steps

We have an exciting roadmap ahead! Check out our TODO list for upcoming features and improvements. Here are some highlights:

In Progress

  • Fixing pyproject.toml TOML syntax error
  • Resolving MkDocs build warnings
  • Enhancing documentation

Coming Soon

  • More template examples
  • Improved AI features
  • Performance optimizations
  • Additional document format support

🀝 Contributing

Contributions are welcome! Please read our Contributing Guidelines for details on how to contribute to this project.

πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

πŸ“§ Contact

For any questions or suggestions, please contact info@softreck.dev.


Made with ❀️ by Text2Doc Team