Skip to content

kirre-bylund/epub-translator

Repository files navigation

epub-translator

A Python project for translating EPUB files, with utilities for extracting and writing EPUB content while preserving formatting and images.

Setup

  1. Create and activate a virtual environment:
    python -m venv .venv
    .venv\Scripts\Activate.ps1
  2. Install dependencies:
    pip install -r requirements.txt

Tools & Libraries

  • EPUB Parsing: zipfile, beautifulsoup4
  • Translation: translate (MVP), HuggingFace Transformers (MarianMT), extensible to LLMs/APIs
  • CLI: argparse
  • Testing: unittest or pytest
  • Progress Bar: tqdm
  • Fine-tuning: transformers, datasets, torch, sentencepiece
  • Sentence extraction: nltk, csv, pandas

Command Line Utilities

Run the CLI using:

python -m epub_translator.cli <command> [options]

Extract HTML and Images from EPUB

Extracts all HTML/XHTML content and images from an EPUB file (preserving formatting):

python -m epub_translator.cli extract --input <input.epub> --output <output_dir>
  • --input: Path to the EPUB file to extract from
  • --output: Directory to save extracted HTML and images

Write HTML(s) to EPUB

Writes one or more HTML/XHTML files (or a directory) to a new EPUB, using a template EPUB for structure:

python -m epub_translator.cli write --input <input_dir> --output <output.epub> --template <template.epub>
  • --input: Directory containing .html/.xhtml files
  • --output: Path to the output EPUB file
  • --template: Path to the template EPUB (required)

Translate EPUB (structure-preserving)

Translates all HTML/XHTML content in an EPUB, preserving all structure and metadata. Now supports checkpointing (pause/resume) and HuggingFace MarianMT backend with automatic device selection:

python -m epub_translator.cli translate --input <input.epub> --output <output.epub> --to-lang <lang>
  • --input: Path to the EPUB file to translate
  • --output: Path to the output EPUB file
  • --to-lang: Target language (default: es)

Features

  • Checkpointing: If translation is interrupted, rerunning will resume from last completed file.
  • HuggingFace MarianMT backend: Uses GPU if available, otherwise CPU.

Fine-tuning MarianMT (English-Swedish)

A script for fine-tuning the Helsinki-NLP/opus-mt-en-sv model on your own parallel data is included:

  • Edit finetune_hf_en_sv.py to set your data file path and parameters.
  • Requires: transformers, datasets, torch, sentencepiece, pandas.
  • Run with:
    python finetune_hf_en_sv.py
  • Outputs a directory with your fine-tuned model.

Extracting Sentence Pairs from EPUBs

A script for extracting aligned sentence pairs from English and Swedish EPUBs is included:

  • Edit and run extract_epub_sentence_pairs.py to generate a parallel corpus CSV.
  • Requires: nltk, beautifulsoup4, csv, pandas (for TSV/CSV handling).
  • Run with:
    python extract_epub_sentence_pairs.py --en_epub <english.epub> --sv_epubs <swedish1.epub> [<swedish2.epub> ...] --output_csv <output.csv>
  • Output: CSV file with aligned English/Swedish sentence pairs for training or evaluation.

Testing

Run all tests:

.venv\Scripts\python.exe -m unittest discover

Project Structure

See project_plan.md for details.


This project is in active development. Contributions and feedback are welcome!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages