A Python library for parsing EPUB files and aligning parallel texts.
Vivre provides tools for processing parallel texts through a complete pipeline: parsing EPUB files, segmenting text into sentences, and aligning sentences between languages using the Gale-Church algorithm. The library offers both a simple API for programmatic use and a powerful command-line interface.
- EPUB Parsing: Robust parsing with content filtering and chapter extraction
- Sentence Segmentation: Multi-language sentence segmentation using spaCy
- Text Alignment: Statistical text alignment using the Gale-Church algorithm
- Multiple Output Formats: JSON, CSV, XML, text, and dictionary formats
- Language Support: English, Spanish, French, German, Italian, Portuguese, and more
- Simple API: Easy-to-use top-level functions for common tasks
- Command Line Interface: Clean CLI with two powerful commands
- Error Handling: Comprehensive error handling with helpful messages
- Type Safety: Full type hints and validation
- Python 3.11 or higher
- pip (Python package installer)
- Clone the repository:
git clone https://github.com/anidixit64/vivre.git
cd vivre
- Install the package:
pip install -e .
- Install required spaCy models:
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm
python -m spacy download it_core_news_sm
- Clone the repository:
git clone https://github.com/anidixit64/vivre.git
cd vivre
- Build the Docker image:
docker build -t vivre .
- Use the helper script for different operations:
# Run test suite (default)
./docker-run.sh
# Drop into interactive shell
./docker-run.sh shell
# Show CLI help
./docker-run.sh cli
# Get help on available options
./docker-run.sh help
The Docker setup includes all dependencies and spaCy models pre-installed.
Vivre provides a clean CLI with two powerful commands:
# Parse and analyze an EPUB file
vivre parse book.epub --verbose
# Parse with content display and segmentation
vivre parse book.epub --show-content --segment --language en
# Parse with custom output format
vivre parse book.epub --format csv --output analysis.csv
# Align two EPUB files (language pair is required)
vivre align english.epub french.epub en-fr
# Align with different output formats
vivre align english.epub french.epub en-fr --format json
vivre align english.epub french.epub en-fr --format csv --output alignments.csv
vivre align english.epub french.epub en-fr --format xml --output alignments.xml
# Align with custom parameters
vivre align english.epub french.epub en-fr --c 1.1 --s2 7.0 --gap-penalty 2.5
# Get help
vivre --help
vivre align --help
vivre parse --help
Quick Start Examples:
# Parse a book and see its structure
vivre parse sample.epub --verbose
# Align English and French versions of the same book
vivre align english_book.epub french_book.epub en-fr --format json --output alignment.json
# Parse with sentence segmentation
vivre parse sample.epub --segment --language en --format csv --output sentences.csv
Vivre provides easy-to-use top-level functions for common tasks:
import vivre
# Parse EPUB and extract chapters
chapters = vivre.read('path/to/epub')
print(f"Found {len(chapters)} chapters")
# Segment chapters into sentences
segmented = chapters.segment('en') # Specify language for better accuracy
sentences = segmented.get_segmented()
# Quick alignment - returns simple sentence pairs
pairs = vivre.quick_align('english.epub', 'french.epub', 'en-fr')
for source, target in pairs[:5]:
print(f"EN: {source}")
print(f"FR: {target}")
# Full alignment with rich output
result = vivre.align('english.epub', 'french.epub', 'en-fr')
print(result.to_json()) # JSON output
print(result.to_csv()) # CSV output
print(result.to_text()) # Formatted text
print(result.to_xml()) # XML output
print(result.to_dict()) # Python dictionary
# Work with Chapters objects seamlessly
source_chapters = vivre.read('english.epub')
target_chapters = vivre.read('french.epub')
result = vivre.align(source_chapters, target_chapters, 'en-fr') # Works with objects too!
# Get supported languages
languages = vivre.get_supported_languages()
print(f"Supported languages: {languages}")
Quick Start Examples:
import vivre
# Parse a book
chapters = vivre.read('sample.epub')
print(f"Book has {len(chapters)} chapters")
# Align two books
result = vivre.align('english.epub', 'french.epub', 'en-fr')
print(result.to_json())
# Get sentence pairs
pairs = vivre.quick_align('english.epub', 'french.epub', 'en-fr')
for en, fr in pairs[:3]:
print(f"EN: {en}")
print(f"FR: {fr}")
print()
For more control, you can use the individual components:
from vivre import VivreParser, Segmenter, Aligner
# Parse EPUB
parser = VivreParser()
chapters = parser.parse_epub('book.epub')
# Segment text
segmenter = Segmenter()
sentences = segmenter.segment('Hello world!', 'en')
# Align texts
aligner = Aligner()
alignments = aligner.align(['Hello'], ['Bonjour'])
# Pipeline for complex workflows
from vivre import VivrePipeline
pipeline = VivrePipeline('en-fr')
result = pipeline.process_parallel_epubs('english.epub', 'french.epub')
read(epub_path)
- Parse EPUB and return Chapters objectalign(source, target, language_pair)
- Align parallel texts, returns AlignmentResultquick_align(source_epub, target_epub, language_pair)
- Simple alignment, returns sentence pairsget_supported_languages()
- Get list of supported language codes
Chapters
- Container for parsed EPUB chapters with segmentation supportAlignmentResult
- Container for alignment results with multiple output formatsVivreParser
- Low-level EPUB parserSegmenter
- Sentence segmentation using spaCyAligner
- Text alignment using Gale-Church algorithmVivrePipeline
- High-level pipeline for complete workflows
The library supports multiple output formats:
- JSON: Structured data for programmatic use
- CSV: Tabular data for spreadsheet applications
- XML: Hierarchical data for document processing
- Text: Human-readable formatted output
- Dict: Python dictionary for direct manipulation
Vivre supports the following languages through spaCy models:
- English (
en_core_web_sm
) - Spanish (
es_core_news_sm
) - French (
fr_core_news_sm
) - Italian (
it_core_news_sm
)
These are the languages for which spaCy models are pre-installed and ready to use for EPUB parsing and text segmentation.
# Run all tests
pytest tests/
# Run with coverage
pytest tests/ --cov=vivre --cov-report=html
# Run specific test files
pytest tests/test_api.py
pytest tests/test_parser.py
For consistent development environments, use Docker:
# Build the development image
docker build -t vivre .
# Run tests in Docker
docker run --rm vivre python -m pytest tests/ -v
# Interactive development shell
docker run --rm -it vivre /bin/bash
# Run specific test with coverage
docker run --rm vivre python -m pytest tests/test_api.py --cov=src/vivre/api --cov-report=term-missing
The project uses pre-commit hooks for code quality:
# Install pre-commit hooks
pre-commit install
# Run hooks manually
pre-commit run --all-files
We welcome contributions! Please see our Contributing Guide for detailed information on how to contribute to this project.
- Fork the repository on GitHub
- Clone your fork locally:
git clone https://github.com/your-username/vivre.git cd vivre
- Create a feature branch:
git checkout -b feature/your-feature-name
- Set up development environment:
# Install dependencies poetry install # Install pre-commit hooks pre-commit install # Install spaCy models poetry run python -m spacy download en_core_web_sm poetry run python -m spacy download es_core_news_sm poetry run python -m spacy download fr_core_news_sm poetry run python -m spacy download it_core_news_sm
- Make your changes and add tests for new functionality
- Run tests and quality checks:
# Run all tests poetry run pytest tests/ # Run with coverage poetry run pytest tests/ --cov=vivre --cov-report=html # Run linting and formatting poetry run ruff check . poetry run ruff format --check . # Run type checking poetry run mypy src/ tests/
- Ensure all tests pass and coverage remains >90%
- Commit your changes with clear commit messages
- Push to your fork and submit a pull request
- Follow the existing code style and conventions
- Add type hints to all new functions
- Include docstrings for all public functions and classes
- Write tests for new functionality
- Update documentation as needed
- Ensure all pre-commit hooks pass
For more detailed information, please see our Contributing Guide.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- License: Apache License 2.0
- SPDX Identifier: Apache-2.0
- Permissions: Commercial use, modification, distribution, patent use, private use
- Limitations: Liability, warranty
- Conditions: License and copyright notice
The Apache License 2.0 is a permissive license that allows for:
- Commercial use
- Modification
- Distribution
- Patent use
- Private use
While providing liability protection and requiring license and copyright notice preservation.
For the complete license text, please see the LICENSE file in this repository.