Skip to content

abdallah-ali-abdallah/turjuman-book-translator

Repository files navigation

📖 Turjuman: Your Smart Book Translation System - Locally and privately hosted 🌍

Python Status

Welcome to Turjuman (ترجمان - Interpreter/Translator in Arabic)! 👋

Ever felt daunted by translating a massive book (like 500 pages and over 150,000 words!)? Turjuman is here to help! using LLMs to magaically translate large documents while trying smartly keep the original meaning and style intact. Currently turjuman supports Markdown .md and plain text .txt files, other formats such as PDF, DOCX, epub, html, subtitles comming soon.

Theme


✨ How Turjuman Works

Turjuman uses a smart pipeline powered by LangGraph 🦜🔗 with two translation modes:

🔄 Translation Modes

  • 🧠 Deep Translation Mode (Default): The comprehensive workflow with terminology unification, critique, and revision steps for higher quality and consistency. Best for professional or publication-ready translations.

  • ⚡ Quick Translation Mode: A streamlined workflow that bypasses terminology unification, critique, and revision steps for faster processing and lower token usage. Ideal for drafts or when speed is more important than perfect quality.

✂️ Smart Chunking Options

Turjuman offers four intelligent chunking strategies to optimize your translation process:

🧠 Smart Mode (Default)

This mode is great for markdown or technical documents. Intelligently identifies and preserves special elements like code blocks, images, URLs, and footnotes. It splits text into optimal chunks while keeping related content together and ensuring non-translatable elements remain intact.

  • Perfect for technical documents, programming tutorials, or content with mixed elements
  • Preserves formatting and structure while optimizing for translation quality
  • Automatically handles bullet points, inline code, and other complex formatting

📏 Line Mode

Splits text by line breaks, making each line a separate chunk. All chunks are considered translatable.

  • Ideal for poetry, lyrics, or content where line breaks have semantic meaning
  • Preserves the exact line structure of the original document
  • Simple and predictable chunking pattern

🔣 Symbol Mode

Divides text based on specific separator symbols (like periods, commas, or custom separators).

  • Great for content with specific delimiter patterns
  • Allows customization of separator symbols
  • Useful for specialized formats with unique separation needs

🎬 Subtitle Mode

Specially designed for .srt subtitle files, separating timing information (non-translatable) from content (translatable).

  • Perfect for subtitle translation projects
  • Preserves exact subtitle timing and formatting
  • Handles subtitle-specific formatting and structure

📋 Translation Pipeline

  1. 🚀 init_translation: Start the translation job
  2. 🧐 terminology_unification: Find and unify key terms, User can provide manual list of prefered glossary or dicitenary "word paires" this feature available in (Deep Mode only)
  3. ✂️ chunk_document: Split the book into chunks using one of the available chunking strategies
  4. 🌐 initial_translation: Translate chunks in parallel
  5. 🤔 critique_stage: Review translations, catch errors (Deep Mode only)
  6. ✨ final_translation: Refine translations (Deep Mode only)
  7. 📜 assemble_document: Stitch everything back together

📊 Translation Flow

flowchart TD
    A([🚀 init_translation<br><sub>Initialize translation state and configs</sub>]) --> Mode{Translation Mode?}
    
    %% Mode decision
    Mode -->|Quick Mode| C([✂️ chunk_document<br><sub>Split the book into manageable chunks</sub>])
    Mode -->|Deep Mode| AA{User Glossary?}
    
    %% Glossary path decision (Deep Mode only)
    AA -->|Yes| AB([📘 User Glossary<br><sub>Use provided glossary terms</sub>])
    AA -->|No| AC([🔍 Auto Extract<br><sub>Extract key terms from document</sub>])
    
    %% Both glossary paths lead to terminology unification
    AB --> B([🧐 terminology_unification<br><sub>Unify glossary, prepare context</sub>])
    AC --> B
    
    B --> C

    %% Chunking produces multiple chunks
    C --> D1([📦 Chunk 1])
    C --> D2([📦 Chunk 2])
    C --> D3([📦 Chunk N])

    %% Parallel translation workers
    D1 --> E1([🌐 initial_translation<br><sub>Translate chunk 1 in parallel</sub>])
    D2 --> E2([🌐 initial_translation<br><sub>Translate chunk 2 in parallel</sub>])
    D3 --> E3([🌐 initial_translation<br><sub>Translate chunk N in parallel</sub>])

    %% Mode-based path after translation
    E1 --> ModeAfter{Translation Mode?}
    E2 --> ModeAfter
    E3 --> ModeAfter
    
    %% Quick Mode path
    ModeAfter -->|Quick Mode| I([📜 assemble_document<br><sub>Merge all chunks into final output</sub>])
    
    %% Deep Mode path
    ModeAfter -->|Deep Mode| F([🤔 critique_stage<br><sub>Review translations, check quality and consistency</sub>])

    %% Decision after critique
    F --> |No critical errors| G([✨ final_translation<br><sub>Refine translations based on feedback</sub>])
    F --> |Critical error| H([🛑 End<br><sub>Stop translation due to errors</sub>])

    G --> I
    I --> J([🏁 Done<br><sub>Translation complete!</sub>])

    H --> J
    

Loading

🛠️ Setup & Installation using conda or venv (for development)

  1. Prerequisites
  • Conda: Install Miniconda or Anaconda
  • API Keys: Get your API keys for OpenAI, Anthropic, etc.
  • Ollama: You can use Turjuman locally without paying for LLM by installing Ollama or any Local Inference server like LMstudio, vLLM, LLamaCPP ..etc, take alook at sample.env for details
  1. Clone the Repository
git clone <your-repo-url>
cd turjuman-book-translator
  1. Create Conda Environment or use python venv
conda create -n turjuman_env python=3.12 -y
conda activate turjuman_env
  1. Install Dependencies
# Install all needed libs
pip install -r requirements.txt
  1. Configure Environment Variables
cp sample.env.file .env
# Edit .env and add your API keys

Recommended LLM Models

  • Online: Gemini Flash/Pro
  • Local: Gemma3 / Aya / Mistral
  1. Run Backend Server
uvicorn src.server:app --host 0.0.0.0 --port 8051 --reload
  1. Run the Web UI

The application will now be accessible at http://localhost:8051.


🚀 Using Turjuman via integrated web UI

visit http://localhost:8051

  • Go to "Configuration" tab and create a new default LLM configurations (LLM provider / model / translation mode, etc.)
  • Save the configuration profile (optional: you can create multiple profiles and select one as the default)
  • Select "New Translation" then upload a file to translate or paste text
  • Modify the source and target language
  • Modify the "Accent and style" if needed (this option can make translation more funny, spicy or professional by default)
  • Start translation. After a few seconds, both logs and text chunks will update dynamically
  • After translation progress reaches 100%, you can view or download the translated file or text
  • You can change the theme from the top drop menu (7 themes available)
  • You can switch the view between chunk or full document to review the translated content chunk by chunk Integarted Web UI Completed Task Theme Chunk view

🔄 Job Queue & History

Turjuman includes a robust job management system:

  • Track all translation jobs with detailed status information (completed, processing, pending, failed)
  • View comprehensive job details including languages, duration, and timestamps
  • Download completed translations directly from the history view
  • Access job-specific glossaries generated during translation
  • View detailed logs and progress information for each job Job History

📚 Glossary Management

Create and manage custom glossaries to ensure consistent terminology:

  • Build custom glossary tables with source and target term pairs
  • Upload glossary files in JSON format
  • Add individual terms through the user interface
  • Set default glossaries for automatic use in translations
  • Download, edit, and delete glossaries as needed
  • Option for automatic terminology extraction during translation Glossary Management

⚙️ Configuration Management

Manage LLM settings and environment variables directly from the UI:

  • Configure multiple LLM providers and models
  • Select translation mode (Deep or Quick) for each configuration
  • Create and save different configuration profiles
  • Set default configurations for quick access
  • Securely manage environment variables (API keys, etc.)
  • Filter available models by keyword
  • Duplicate existing configurations for easy modification Configuration

BASH Script Client

A convenient command-line client script (translate_over_api_terminal.sh) is provided for interacting with the backend API.

Prerequisites: curl, jq

Getting Help:

The script includes detailed usage instructions. To view them, run:

./translate_over_api_terminal.sh --help

or

./translate_over_api_terminal.sh -h

Basic Usage:

The only required argument is the input file (-i or --input). Other options allow you to specify languages, provider, model, API URL, and output file path.

# Translate a file using default settings (English->Arabic, OpenAI provider, default model)
# Ensure OPENAI_API_KEY is set in .env if using openai
./translate_over_api_terminal.sh -i path/to/your/document.md

# Specify languages, provider, model, and save response to a specific file
./translate_over_api_terminal.sh \
  --input my_book.md \
  --output results/my_book_translated.json \
  --source english \
  --target french \
  --provider ollama \
  --model llama3

# Use a different API endpoint
./translate_over_api_terminal.sh -i chapter1.md -u http://192.168.1.100:8051

# List available models fetched from the backend API
./translate_over_api_terminal.sh --list-models

The script submits the job via the API. Since the API call is synchronous, the script waits for completion and saves the full JSON response (containing the final state and the translated document in output.final_document) to a file (default: <input_name>_<job_id>.json or the path specified with --output). It also provides the curl command to retrieve the final state again using the job ID.



🗺️ Future Plans

  • Support for PDF, DOCX, and other formats
  • Further enhancements to glossary and terminology management
  • Interactive editing and feedback loop
  • Advanced customization options for translation styles
  • Additional translation modes with different quality/speed tradeoffs
  • Batch processing capabilities for multiple documents

🤝 Contributing

Pull requests welcome! For major changes, open an issue first.


📄 License

MIT


Enjoy translating your books with Turjuman! 🚀📚🌍

About

Translate full books and large texts with LLM autonomously

Resources

Stars

Watchers

Forks

Packages

No packages published