PDF Translate — Automated PDF Translation & Redaction (Python)

DEPLOYED LINK

Automate high-quality translation and selective redaction of PDFs while preserving layout, font sizing, and colors. The project blends:

OCR (via ocrmypdf/Tesseract) for scanned or low-quality PDFs
Text-layer analysis (PyMuPDF/fitz) for precise boxes, spans, lines, and blocks
AI translation (Google Translate via googletrans)
Overlay & drawing logic to put translated text back exactly where it belongs
Redaction/masking that adapts to background/foreground contrast

It supports English ↔︎ Hindi out of the box and can be extended to other scripts.

Key features

Language translation English ↔︎ Hindi supported; auto direction detection available. Extendable by swapping fonts & translation parameters.
Text layer analysis Extracts spans, lines, blocks, and a hybrid (column/table-aware) mode to keep text where it belongs even in multi-column pages and tables.
OCR for scanned PDFs Uses ocrmypdf to produce a clean, searchable PDF prior to analysis.
Style preservation Transfers font size & color from the original objects to the translated overlay so the result looks native.
Smart redaction / masking
- redact (true PDF redactions) or mask (draw filled rectangles).
- Fill color is chosen dynamically from surrounding luminance (dark text → white fill, light text → black fill) to maintain visual consistency.
Overlay options
- Generate overlays automatically from the current document, or
- Drive from JSON (page + bbox + translated_text) to paint exactly what you want.
- Render as real text or high-DPI images (for bulletproof glyph coverage).
CLI & Python API A single unified script provides modes for span, line, block, hybrid, overlay, and all (batch all modes + zip).
Error correction helpers Normalizes whitespace and punctuation spacing; de-noises OCR artifacts where possible.
Multiple input formats Any format PyMuPDF can open (primarily PDF; images should be PDF-wrapped before processing—ocrmypdf handles this).
Security & compliance Use local OCR and redaction; redact before writing translated text to prevent data leaks in sensitive areas.

How it works

OCR pass (optional but recommended) ocrmypdf runs with language packs (e.g., hin+eng) and deskew/rotate to create a clean text layer.
Text extraction & structure building PyMuPDF extracts raw dicts of blocks/lines/spans; the code constructs:
- basic spans, lines, blocks
- hybrid blocks that split each raw line into segments by significant X-gaps (detects table cells / columns)
Style sampling A lightweight index of original color & font size is built and transferred to translated objects using IoU/nearest heuristics.
Translation Uses googletrans (Google Translate) with direction:
- hi->en, en->hi, or auto (detect from dominant script).
Erasure / Redaction Depending on mode:
- mask: draw filled rectangles (per-box adaptive fill)
- redact: actual redaction annotations applied page-wide
Overlay The translated text is written back using either:
- Text boxes (insert_textbox with font fallback), or
- High-DPI image tiles rendered via PIL for maximum glyph fidelity.
All-mode Runs span, line, block, hybrid, and optionally overlay, writing separate PDFs and a combined ZIP.

Project structure

PDF-TRANSLATOR
├── app.py                      # (Optional) app entry (e.g., Streamlit)
├── PDF_Translate/              # Modular library
│   ├── __init__.py
│   ├── cli.py
│   ├── constants.py
│   ├── hybrid.py
│   ├── ocr.py
│   ├── overlay.py
│   ├── pipeline.py
│   ├── textlayer.py
│   └── utils.py
├── pdf_translate_unified.py    # Unified CLI/API (span/line/block/hybrid/overlay/all)
├── assets/
│   └── fonts/                  # Pre-bundled font files (English & Devanagari)
│       ├── NotoSans-Regular.ttf
│       ├── NotoSans-Bold.ttf
│       ├── NotoSansDevanagari-Regular.ttf
│       ├── NotoSansDevanagari-Bold.ttf
│       ├── TiroDevanagariHindi-Regular.ttf
│       ├── Hind-Regular.ttf
│       ├── Karma-Regular.ttf
│       └── Mukta-Regular.ttf
├── samples/
│   ├── Test1.pdf
│   ├── Test1_translated.pdf
│   ├── Test2.pdf
│   ├── Test2_translated.pdf
│   ├── Test3.pdf
│   └── Test3_translated.pdf
├── output_pdfs/                # Generated outputs land here
├── temp/                       # OCR/rasterization scratch (e.g., ocr_fixed.pdf)
├── requirements.txt
├── Dockerfile
└── Readme.md                   # (this document)

Installation

1) System prerequisites

Python: 3.12 recommended
Tesseract & ocrmypdf: required for OCR
Ghostscript + qpdf: required by ocrmypdf

Ubuntu/Debian

sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-hin ocrmypdf ghostscript qpdf

macOS (Homebrew)

brew install tesseract ocrmypdf ghostscript qpdf

Windows

Install Tesseract (UB Mannheim build recommended) and make sure tesseract.exe is on PATH.
Install Ghostscript and qpdf; add to PATH.
Install ocrmypdf via pip (will use the system binaries above).

2) Python packages

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Quick start

Translate a PDF (English → Hindi) using all modes:

python pdf_translate_unified.py \
  --input samples/Test3.pdf \
  --output output_pdfs/result.pdf \
  --mode all \
  --translate en->hi

What you get:

result.span.pdf
result.line.pdf
result.block.pdf
result.hybrid.pdf
result.overlay.pdf
result_all_methods.zip bundling the above

Command-line usage

python pdf_translate_unified.py --help

Required

--input / -i: path to your source PDF
--output / -o: output path (for --mode all, this is the base name)

Modes

--mode {span,line,block,hybrid,overlay,all} (default: all)

When to use which

span – ultra-fine placement, best for mixed inline styles; can look busy
line – per line; balances fidelity & readability
block – per paragraph/block; often the cleanest look
hybrid – column/table-aware; great for multi-column layouts and tabular data
overlay – paint from a JSON (see below) or from --auto-overlay
all – run several modes and zip them for comparison

OCR options

--lang (default: hin+eng) – languages passed to ocrmypdf
--dpi (default: 1000) – --image-dpi/--oversample for ocrmypdf
--optimize (default: 3) – ocrmypdf --optimize level
--skip-ocr – use the input PDF as-is (not recommended for scanned PDFs)

Translation direction

--translate {hi->en,en->hi,auto} (default: hi->en)

Redaction / masking

--erase {redact,mask,none} (default: redact)
--redact-color r,g,b – only used when a fixed color is required; otherwise the tool automatically picks black or white from context.

Fonts

--font-en-name (logical name; default NotoSans)
--font-en-path (path to TTF; default bundled Noto Sans)
--font-hi-name (default NotoSansDevanagari)
--font-hi-path (path to Devanagari TTF; defaults to Base14 helv if missing)

Overlay-specific knobs

--overlay-json /path/to/text_data.json
--auto-overlay – build overlay items from the doc and chosen --translate
--overlay-render {image,textbox} (default image)
--overlay-align {0,1,2,3} – left/center/right/justify (justify only for textbox)
--overlay-line-spacing (default 1.10)
--overlay-margin-px (default 0.1)
--overlay-target-dpi (default 600)
--overlay-scale-x|y, --overlay-off-x|y – fix geometry if the JSON was created on a near-duplicate PDF

Example commands

1) English → Hindi (hybrid mode)

python pdf_translate_unified.py -i samples/Test1.pdf -o output_pdfs/t1.hybrid.pdf \
  --mode hybrid --translate en->hi

2) Hindi → English (block mode, masking)

python pdf_translate_unified.py -i samples/Test2.pdf -o output_pdfs/t2.block.pdf \
  --mode block --translate hi->en --erase mask

3) Overlay from JSON with real text (keep searchable layer)

python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \
  --mode overlay --overlay-json text_data.json --overlay-render textbox \
  --overlay-align 0 --overlay-line-spacing 1.15

4) Auto-overlay (no JSON; build from doc)

python pdf_translate_unified.py -i samples/Test3.pdf -o output_pdfs/t3.overlay.pdf \
  --mode overlay --auto-overlay --translate en->hi

Fonts

For Devanagari, the bundled fonts work well:

NotoSansDevanagari-Regular.ttf
TiroDevanagariHindi-Regular.ttf
Others: Hind, Mukta, Karma

Specify alternatives via --font-hi-path. For English, NotoSans is the default.

Overlay JSON

You can drive the overlay precisely with a JSON file:

[
  {
    "page": 0,
    "bbox": [72.0, 144.0, 270.0, 180.0],
    "translated_text": "Hello world",
    "fontsize": 11.5
  }
]

Required: page, bbox ([x0,y0,x1,y1] in PDF points), translated_text
Optional: fontsize (used as a base; the renderer will fit it)

Run:

python pdf_translate_unified.py -i in.pdf -o out.pdf \
  --mode overlay --overlay-json text_data.json

Geometry mismatch? If your JSON came from a slightly different source PDF:

--overlay-scale-x|y to scale all boxes
--overlay-off-x|y to shift them

Python API (modular usage)

You can call the building blocks directly from Python for custom pipelines.

from pdf_translate_unified import (
    extract_original_page_objects, ocr_fix_pdf, build_base,
    resolve_font, run_mode, build_overlay_items_from_doc
)

input_pdf = "samples/Test3.pdf"
output_pdf = "output_pdfs/demo_all.pdf"
translate_direction = "en->hi"

# 1) Style index from original (pre-OCR) for accurate color/size
orig_index = extract_original_page_objects(input_pdf)

# 2) OCR pass
src_fixed = ocr_fix_pdf(input_pdf, lang="hin+eng", dpi="1000", optimize="3")

# 3) Create source/output documents with background preserved
src, out = build_base(src_fixed)

# 4) Configure fonts
en_name, en_file = resolve_font("NotoSans", "assets/fonts/NotoSans-Regular.ttf")
hi_name, hi_file = resolve_font("NotoSansDevanagari", "assets/fonts/TiroDevanagariHindi-Regular.ttf")

# 5) Optional: auto-build overlay items
overlay_items = build_overlay_items_from_doc(src, translate_direction)

# 6) Run any mode (or "all")
run_mode(
    mode="all",
    src=src, out=out,
    orig_index=orig_index,
    translate_dir=translate_direction,
    erase_mode="redact",
    redact_color=(1,1,1),
    font_en_name=en_name, font_en_file=en_file,
    font_hi_name=hi_name, font_hi_file=hi_file,
    output_pdf=output_pdf,
    overlay_items=overlay_items,
    overlay_render="image",
    overlay_target_dpi=600
)

Docker

Build:

docker build -t pdf-translate .

Run (mount your PDFs):

docker run --rm -v "$PWD:/work" pdf-translate \
  python pdf_translate_unified.py -i /work/samples/Test3.pdf \
  -o /work/output_pdfs/result.pdf --mode all --translate en->hi

Samples & outputs

See samples/ for input PDFs and _translated.pdf examples. Recent runs create files under output_pdfs/, including individual mode outputs and a zipped bundle like:

result_YYYYMMDD-HHMMSS.all.block.pdf
result_YYYYMMDD-HHMMSS.all.hybrid.pdf
result_YYYYMMDD-HHMMSS.all.line.pdf
result_YYYYMMDD-HHMMSS.all.overlay.pdf
result_YYYYMMDD-HHMMSS.all.span.pdf
result_YYYYMMDD-HHMMSS_all_methods.zip

Limitations & notes

googletrans relies on unofficial endpoints; for production, consider swapping in an official translation API (e.g., Google Cloud Translate, Azure, DeepL).
OCR quality determines downstream accuracy; garbage in → garbage out.
Complex vector art or text on curves isn’t reflowed; overlay is rectangular.
True layout editing (re-wrapping across pages) is out of scope by design.

Contributing

Issues and PRs are welcome!:

New language/font packs & font auto-selection rules
Pluggable translator backends
Better table detection & alignment heuristics
Streamlit UX in app.py for drag-and-drop PDFs

Please run ruff/black (if configured) and include before/after sample PDFs for visual changes.

Acknowledgements

PyMuPDF (fitz) for robust PDF parsing/rendering
ocrmypdf + Tesseract for OCR
Pillow (PIL) for high-DPI text rendering in image overlays
Google Translate (via googletrans) for quick translation prototyping

Author

Mohit Gupta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Translate — Automated PDF Translation & Redaction (Python)

DEPLOYED LINK

Table of contents

Key features

How it works

Project structure

Installation

1) System prerequisites

2) Python packages

Quick start

Command-line usage

Required

Modes

OCR options

Translation direction

Redaction / masking

Fonts

Overlay-specific knobs

Example commands

Fonts

Overlay JSON

Python API (modular usage)

Docker

Samples & outputs

Limitations & notes

Contributing

Acknowledgements

Author

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.streamlit		.streamlit
Notebook		Notebook
PDF_Translate		PDF_Translate
Pics		Pics
assets		assets
output_pdfs		output_pdfs
temp		temp
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
Readme.md		Readme.md
app.py		app.py
requirements.txt		requirements.txt

MohitGupta0123/HIN_EN_PDF_Translator

Folders and files

Latest commit

History

Repository files navigation

PDF Translate — Automated PDF Translation & Redaction (Python)

DEPLOYED LINK

Table of contents

Key features

How it works

Project structure

Installation

1) System prerequisites

2) Python packages

Quick start

Command-line usage

Required

Modes

OCR options

Translation direction

Redaction / masking

Fonts

Overlay-specific knobs

Example commands

Fonts

Overlay JSON

Python API (modular usage)

Docker

Samples & outputs

Limitations & notes

Contributing

Acknowledgements

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages