epub-translator

A Python project for translating EPUB files, with utilities for extracting and writing EPUB content while preserving formatting and images.

Setup

Create and activate a virtual environment:

python -m venv .venv
.venv\Scripts\Activate.ps1

Install dependencies:
```
pip install -r requirements.txt
```

Tools & Libraries

EPUB Parsing: zipfile, beautifulsoup4
Translation: translate (MVP), HuggingFace Transformers (MarianMT), extensible to LLMs/APIs
CLI: argparse
Testing: unittest or pytest
Progress Bar: tqdm
Fine-tuning: transformers, datasets, torch, sentencepiece
Sentence extraction: nltk, csv, pandas

Command Line Utilities

Run the CLI using:

python -m epub_translator.cli <command> [options]

Extract HTML and Images from EPUB

Extracts all HTML/XHTML content and images from an EPUB file (preserving formatting):

python -m epub_translator.cli extract --input <input.epub> --output <output_dir>

--input: Path to the EPUB file to extract from
--output: Directory to save extracted HTML and images

Write HTML(s) to EPUB

Writes one or more HTML/XHTML files (or a directory) to a new EPUB, using a template EPUB for structure:

python -m epub_translator.cli write --input <input_dir> --output <output.epub> --template <template.epub>

--input: Directory containing .html/.xhtml files
--output: Path to the output EPUB file
--template: Path to the template EPUB (required)

Translate EPUB (structure-preserving)

Translates all HTML/XHTML content in an EPUB, preserving all structure and metadata. Now supports checkpointing (pause/resume) and HuggingFace MarianMT backend with automatic device selection:

python -m epub_translator.cli translate --input <input.epub> --output <output.epub> --to-lang <lang>

--input: Path to the EPUB file to translate
--output: Path to the output EPUB file
--to-lang: Target language (default: es)

Features

Checkpointing: If translation is interrupted, rerunning will resume from last completed file.
HuggingFace MarianMT backend: Uses GPU if available, otherwise CPU.

Fine-tuning MarianMT (English-Swedish)

A script for fine-tuning the Helsinki-NLP/opus-mt-en-sv model on your own parallel data is included:

Edit finetune_hf_en_sv.py to set your data file path and parameters.
Requires: transformers, datasets, torch, sentencepiece, pandas.
Run with:
```
python finetune_hf_en_sv.py
```
Outputs a directory with your fine-tuned model.

Extracting Sentence Pairs from EPUBs

A script for extracting aligned sentence pairs from English and Swedish EPUBs is included:

Edit and run extract_epub_sentence_pairs.py to generate a parallel corpus CSV.
Requires: nltk, beautifulsoup4, csv, pandas (for TSV/CSV handling).

Run with:

python extract_epub_sentence_pairs.py --en_epub <english.epub> --sv_epubs <swedish1.epub> [<swedish2.epub> ...] --output_csv <output.csv>

Output: CSV file with aligned English/Swedish sentence pairs for training or evaluation.

Testing

Run all tests:

.venv\Scripts\python.exe -m unittest discover

Project Structure

See project_plan.md for details.

This project is in active development. Contributions and feedback are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
epub_translator		epub_translator
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
README.md		README.md
extract_epub_sentence_pairs.py		extract_epub_sentence_pairs.py
finetune_hf_en_sv.py		finetune_hf_en_sv.py
project_plan.md		project_plan.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

epub-translator

Setup

Tools & Libraries

Command Line Utilities

Extract HTML and Images from EPUB

Write HTML(s) to EPUB

Translate EPUB (structure-preserving)

Features

Fine-tuning MarianMT (English-Swedish)

Extracting Sentence Pairs from EPUBs

Testing

Project Structure

About

Uh oh!

Releases

Packages

Languages

kirre-bylund/epub-translator

Folders and files

Latest commit

History

Repository files navigation

epub-translator

Setup

Tools & Libraries

Command Line Utilities

Extract HTML and Images from EPUB

Write HTML(s) to EPUB

Translate EPUB (structure-preserving)

Features

Fine-tuning MarianMT (English-Swedish)

Extracting Sentence Pairs from EPUBs

Testing

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages