These scripts are primarily for my personal use, but I'm sharing them here as a backup and for anyone who might find them useful. Apologies for any messy code!
token_chunk.py
– Splits text into smaller chunks for AI translation, adding line numbers for easier reference.chunk_copier.py
– Utility to process and copy XML-style text chunks to the clipboard with a system prompt.check_translate.py
– Verifies and checks translation results.join_translations.py
– Merges multiple translations into a bilingual or trilingual file.translate_dir_gemini.py
– Translates all files in a directory using the Gemini API.- etc...
AI/LLM-generated translations can be inaccurate or misleading. They should be used as reference only, not as definitive translations.
However, they are useful for keyword searching in full-text searches. By identifying where a topic appears in the text, you can refer to the original pāḷi/text for precise understanding, saving time in locating key passages.
To preserve formatting (bold, italics, etc.), use Markdown as input for AI translation.
There are many ways to get you source files converted into markdown
format files. One complicated way is:
- Create a document in Google Docs.
- Copy and paste formatted text (e.g., from Kaṅkhāvitaraṇī-aṭṭhakathā) into Google Docs:
- Windows/Linux:
Ctrl + C
→Ctrl + V
- Mac:
Cmd + C
→Cmd + V
- Windows/Linux:
- In Google Docs, go to File > Download > Markdown (.md).
- Rename the
.md
file extension to.txt
, then open it in VS Code or another text editor. - Clean up the text:
- Replace
¶
with a space. - Normalize spacing.
- Replace
- Use regex to format elements like headings (
#
,##
, etc.).
Since LLMs have input limits, large texts must be split into smaller chunks.
- Create a virtual environment and install dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip install tiktoken pyperclip google-genai bs4 lxml prompt_toolkit ratelimit pandoc pypandoc unidecode
Adjust --max-tokens
based on the LLM’s input limit.
# Using default --max-tokens 6000
python3 token_chunk.py -f your_text_file.txt
# Using --max-tokens 2000
python3 token_chunk.py -f your_text_file.txt --max-tokens 5000
# Process all .txt files in a directory
python3 token_chunk.py -d your_text_file_directory
This generates chunked files (do not rename them, they are needed for later steps):
your_text_file_{number}_chunks.xml
– Chunked text with line IDsyour_text_file_{number}_chunks_translated_1.xml
– AI 1 translationyour_text_file_{number}_chunks_translated_2.xml
– AI 2 translationyour_text_file_{number}_chunks_translated_3.xml
– AI 3 translation- ...
By checking line IDs, you can verify if the AI skipped any lines.
If you are using LLMs via their Web UI, this script will save you a lot of time:
python3 chunk_copier.py
Follow the prompts to:
- Enter the system prompt file path.
- Specify the chunked file path.
- Define the number of chunks to copy at once.
- Optionally, provide a website URL to open after copying.
After translation, verify for missing lines:
python3 check_translate.py
If any lines are missing, manually translate them and run the check again.
⚠ Note: LLMs often merge stanzas or meaning-related lines together, which can result in missing IDs.
Manual correction is required in such cases, the check_translate.py
may help to list out the missing IDs.
Example stanzas:
ID958=‘‘Āpattidassanussāho, na kattabbo kudācanaṃ;
ID959=Passissāmi anāpatti-miti kayirātha mānasaṃ.
ID960=‘‘Passitvāpi ca āpattiṃ, avatvāva punappunaṃ;
ID961=Vīmaṃsitvātha viññūhi, saṃsanditvā ca taṃ vade.
To merge translations into bilingual/trilingual files, run:
python3 join_translation.py
Using multiple LLM outputs allows better comparison and verification.
Translate chunks sequentially using the system prompts:
- Change Vinayasaṅgaha-aṭṭhakathā to your specific text.
- Modify source and target languages as needed.
- Tip: Translating Pāli → English → Other Language is often more accurate than translating Pāli → Other Language directly.
- Model:
gemini-2.0-pro-exp-02-05
(2M token limit) - Temperature:
1.0
(or1.3
for creativity) - Output length:
8192
- Top P:
0.8
- Safety settings: All set to "Block none"
⚠ Gemini may still block translations due to safety filters, even when disabled.
🔗 Grok-3
- Better prompt adherence
- (So far) never blocks translations due to safety reasons
For the latest top-performing models, check: 🔗 LM Arena Leaderboard