A Python project for translating EPUB files, with utilities for extracting and writing EPUB content while preserving formatting and images.
- Create and activate a virtual environment:
python -m venv .venv .venv\Scripts\Activate.ps1
- Install dependencies:
pip install -r requirements.txt
- EPUB Parsing:
zipfile
,beautifulsoup4
- Translation:
translate
(MVP), HuggingFace Transformers (MarianMT), extensible to LLMs/APIs - CLI:
argparse
- Testing:
unittest
orpytest
- Progress Bar:
tqdm
- Fine-tuning:
transformers
,datasets
,torch
,sentencepiece
- Sentence extraction:
nltk
,csv
,pandas
Run the CLI using:
python -m epub_translator.cli <command> [options]
Extracts all HTML/XHTML content and images from an EPUB file (preserving formatting):
python -m epub_translator.cli extract --input <input.epub> --output <output_dir>
--input
: Path to the EPUB file to extract from--output
: Directory to save extracted HTML and images
Writes one or more HTML/XHTML files (or a directory) to a new EPUB, using a template EPUB for structure:
python -m epub_translator.cli write --input <input_dir> --output <output.epub> --template <template.epub>
--input
: Directory containing .html/.xhtml files--output
: Path to the output EPUB file--template
: Path to the template EPUB (required)
Translates all HTML/XHTML content in an EPUB, preserving all structure and metadata. Now supports checkpointing (pause/resume) and HuggingFace MarianMT backend with automatic device selection:
python -m epub_translator.cli translate --input <input.epub> --output <output.epub> --to-lang <lang>
--input
: Path to the EPUB file to translate--output
: Path to the output EPUB file--to-lang
: Target language (default: es)
- Checkpointing: If translation is interrupted, rerunning will resume from last completed file.
- HuggingFace MarianMT backend: Uses GPU if available, otherwise CPU.
A script for fine-tuning the Helsinki-NLP/opus-mt-en-sv model on your own parallel data is included:
- Edit
finetune_hf_en_sv.py
to set your data file path and parameters. - Requires:
transformers
,datasets
,torch
,sentencepiece
,pandas
. - Run with:
python finetune_hf_en_sv.py
- Outputs a directory with your fine-tuned model.
A script for extracting aligned sentence pairs from English and Swedish EPUBs is included:
- Edit and run
extract_epub_sentence_pairs.py
to generate a parallel corpus CSV. - Requires:
nltk
,beautifulsoup4
,csv
,pandas
(for TSV/CSV handling). - Run with:
python extract_epub_sentence_pairs.py --en_epub <english.epub> --sv_epubs <swedish1.epub> [<swedish2.epub> ...] --output_csv <output.csv>
- Output: CSV file with aligned English/Swedish sentence pairs for training or evaluation.
Run all tests:
.venv\Scripts\python.exe -m unittest discover
See project_plan.md
for details.
This project is in active development. Contributions and feedback are welcome!