AI/LLM Translation Assistant

These scripts are primarily for my personal use, but I'm sharing them here as a backup and for anyone who might find them useful. Apologies for any messy code!

📂 Script Overview

token_chunk.py – Splits text into smaller chunks for AI translation, adding line numbers for easier reference.
chunk_copier.py – Utility to process and copy XML-style text chunks to the clipboard with a system prompt.
check_translate.py – Verifies and checks translation results.
join_translations.py – Merges multiple translations into a bilingual or trilingual file.
translate_dir_gemini.py – Translates all files in a directory using the Gemini API.
etc...

⚠ Important Notice

AI/LLM-generated translations can be inaccurate or misleading. They should be used as reference only, not as definitive translations.

However, they are useful for keyword searching in full-text searches. By identifying where a topic appears in the text, you can refer to the original pāḷi/text for precise understanding, saving time in locating key passages.

🚀 How to Use

1️⃣ Preparing Markdown for AI Translation

To preserve formatting (bold, italics, etc.), use Markdown as input for AI translation.

Steps:

There are many ways to get you source files converted into markdown format files. One complicated way is:

Create a document in Google Docs.
Copy and paste formatted text (e.g., from Kaṅkhāvitaraṇī-aṭṭhakathā) into Google Docs:
- Windows/Linux: Ctrl + C → Ctrl + V
- Mac: Cmd + C → Cmd + V
In Google Docs, go to File > Download > Markdown (.md).
Rename the .md file extension to .txt, then open it in VS Code or another text editor.
Clean up the text:
- Replace ¶ with a space.
- Normalize spacing.
Use regex to format elements like headings (#, ##, etc.).

2️⃣ Chunking and Translating

Since LLMs have input limits, large texts must be split into smaller chunks.

Setup:

Create a virtual environment and install dependencies:

python3 -m venv .venv   

source .venv/bin/activate

pip install tiktoken pyperclip google-genai bs4 lxml prompt_toolkit ratelimit pandoc pypandoc unidecode

Chunking the Text:

Adjust --max-tokens based on the LLM’s input limit.

# Using default --max-tokens 6000
python3 token_chunk.py -f your_text_file.txt

# Using --max-tokens 2000
python3 token_chunk.py -f your_text_file.txt --max-tokens 5000

# Process all .txt files in a directory
python3 token_chunk.py -d your_text_file_directory

This generates chunked files (do not rename them, they are needed for later steps):

your_text_file_{number}_chunks.xml – Chunked text with line IDs
your_text_file_{number}_chunks_translated_1.xml – AI 1 translation
your_text_file_{number}_chunks_translated_2.xml – AI 2 translation
your_text_file_{number}_chunks_translated_3.xml – AI 3 translation
...

By checking line IDs, you can verify if the AI skipped any lines.

Auto Copy Chunks:

If you are using LLMs via their Web UI, this script will save you a lot of time:

python3 chunk_copier.py

Follow the prompts to:

Enter the system prompt file path.
Specify the chunked file path.
Define the number of chunks to copy at once.
Optionally, provide a website URL to open after copying.

Checking Translations:

After translation, verify for missing lines:

python3 check_translate.py

If any lines are missing, manually translate them and run the check again.

⚠ Note: LLMs often merge stanzas or meaning-related lines together, which can result in missing IDs.

Manual correction is required in such cases, the check_translate.py may help to list out the missing IDs.

Example stanzas:

ID958=‘‘Āpattidassanussāho, na kattabbo kudācanaṃ;  
ID959=Passissāmi anāpatti-miti kayirātha mānasaṃ.  
ID960=‘‘Passitvāpi ca āpattiṃ, avatvāva punappunaṃ;  
ID961=Vīmaṃsitvātha viññūhi, saṃsanditvā ca taṃ vade.

3️⃣ Generating Bilingual/Multilingual Files

To merge translations into bilingual/trilingual files, run:

python3 join_translation.py

Using multiple LLM outputs allows better comparison and verification.

4️⃣ Translation Prompts

Translate chunks sequentially using the system prompts:

Customization:

Change Vinayasaṅgaha-aṭṭhakathā to your specific text.
Modify source and target languages as needed.
Tip: Translating Pāli → English → Other Language is often more accurate than translating Pāli → Other Language directly.

🔥 Recommended LLM Models

1️⃣ Google AI Studio

🔗 Google AI Studio

Recommended Settings:

Model: gemini-2.0-pro-exp-02-05 (2M token limit)
Temperature: 1.0 (or 1.3 for creativity)
Output length: 8192
Top P: 0.8
Safety settings: All set to "Block none"

⚠ Gemini may still block translations due to safety filters, even when disabled.

2️⃣ Grok3 (As of Feb 20, 2025)

🔗 Grok-3

Better prompt adherence
(So far) never blocks translations due to safety reasons

3️⃣ Alternative AI Models

Claude
ChatGPT
Deepseek, etc.

For the latest top-performing models, check: 🔗 LM Arena Leaderboard

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.cursorignore		.cursorignore
.gitignore		.gitignore
README.md		README.md
adjust_chunk.py		adjust_chunk.py
bash_all_html.sh		bash_all_html.sh
bash_single.sh		bash_single.sh
check_translate.py		check_translate.py
check_translate_dir.py		check_translate_dir.py
chunk_copier.py		chunk_copier.py
counter_token.py		counter_token.py
dvematikapali.txt		dvematikapali.txt
dvematikapali_55_chunks.xml		dvematikapali_55_chunks.xml
dvematikapali_55_chunks_translated_1.xml		dvematikapali_55_chunks_translated_1.xml
dvematikapali_55_chunks_translated_2.xml		dvematikapali_55_chunks_translated_2.xml
dvematikapali_55_chunks_translated_3.xml		dvematikapali_55_chunks_translated_3.xml
gen_tpo_html.py		gen_tpo_html.py
join_translations.py		join_translations.py
md_tipitaka_lk.py		md_tipitaka_lk.py
md_tipitaka_lk_3lang.py		md_tipitaka_lk_3lang.py
prompt_Pali_English.md		prompt_Pali_English.md
prompt_Sinhala_English.md		prompt_Sinhala_English.md
token_chunk.py		token_chunk.py
tpo_html_template.html		tpo_html_template.html
translate_dir_gemini.py		translate_dir_gemini.py
translator_gemini.py		translator_gemini.py
translator_gemini_think.py		translator_gemini_think.py
vipassakarama_tpo_template.html		vipassakarama_tpo_template.html
vri_xml_text_tokens.txt		vri_xml_text_tokens.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI/LLM Translation Assistant

📂 Script Overview

⚠ Important Notice

🚀 How to Use

1️⃣ Preparing Markdown for AI Translation

Steps:

2️⃣ Chunking and Translating

Setup:

Chunking the Text:

Auto Copy Chunks:

Checking Translations:

3️⃣ Generating Bilingual/Multilingual Files

4️⃣ Translation Prompts

Customization:

🔥 Recommended LLM Models

1️⃣ Google AI Studio

Recommended Settings:

2️⃣ Grok3 (As of Feb 20, 2025)

3️⃣ Alternative AI Models

About

Uh oh!

Releases

Languages

vpnry/ai-translation

Folders and files

Latest commit

History

Repository files navigation

AI/LLM Translation Assistant

📂 Script Overview

⚠ Important Notice

🚀 How to Use

1️⃣ Preparing Markdown for AI Translation

Steps:

2️⃣ Chunking and Translating

Setup:

Chunking the Text:

Auto Copy Chunks:

Checking Translations:

3️⃣ Generating Bilingual/Multilingual Files

4️⃣ Translation Prompts

Customization:

🔥 Recommended LLM Models

1️⃣ Google AI Studio

Recommended Settings:

2️⃣ Grok3 (As of Feb 20, 2025)

3️⃣ Alternative AI Models

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages