Chinese words leftover detection. #437

AkamashiDesu · 2025-04-23T04:26:07Z

AkamashiDesu
Apr 23, 2025

So, for anyone that use AI like Gemini to translate an epub from Chinese to English probably encounter a situation where there is some Chinese words leftover like this:

"I'm still a little confused. I suddenly became a駙馬(imperial consort)?"
or "Tong Xinya had already set off last night and hired dozens of guards from镖局to scout the road ahead...", ect...

Since it's kind of annoying to have to re-check every group of text for that error so I (ask ChatGPT to) create python script to detect those errors.

ChatGPT link: https://chatgpt.com/canvas/shared/68086c9e4fe48191a9be21cc155ac1d9

`

import zipfile
import re
import sys
import csv

Regular expression to match CJK Unified Ideographs (common Chinese characters)

CHINESE_CHAR_RE = re.compile(r"[\u4e00-\u9fff]")

Patterns to detect
tags

P_START_RE = re.compile(r"<p[\s>].*?>", re.IGNORECASE)
P_END_RE = re.compile(r"

", re.IGNORECASE)

def find_chinese_in_epub(epub_path):
"""
Scan an EPUB file for Chinese characters inside

tags and return a list of results.

Each result is a tuple: (file_path, start_line, end_line, paragraph_text)
"""
results = []
try:
    zf = zipfile.ZipFile(epub_path, 'r')
except Exception as e:
    sys.exit(f"Error opening EPUB: {e}")

for name in zf.namelist():
    if not name.lower().endswith(('.xhtml', '.html', '.htm')):
        continue

    try:
        raw = zf.read(name)
        text = raw.decode('utf-8')
    except Exception:
        try:
            text = raw.decode('latin-1')
        except Exception:
            continue

    lines = text.splitlines()
    inside_p = False
    buffer = []
    start_line = 0

    for lineno, line in enumerate(lines, start=1):
        if not inside_p and P_START_RE.search(line):
            inside_p = True
            buffer = [line]
            start_line = lineno
            # Single-line <p> block
            if P_END_RE.search(line):
                inside_p = False
                paragraph = '\n'.join(buffer)
                if CHINESE_CHAR_RE.search(paragraph):
                    text_only = re.sub(r'<.*?>', '', paragraph).strip()
                    results.append((name, start_line, lineno, text_only))
        elif inside_p:
            buffer.append(line)
            if P_END_RE.search(line):
                inside_p = False
                paragraph = '\n'.join(buffer)
                if CHINESE_CHAR_RE.search(paragraph):
                    text_only = re.sub(r'<.*?>', '', paragraph).strip()
                    results.append((name, start_line, lineno, text_only))

zf.close()
return results

def write_results_to_csv(results, csv_path):
"""
Write the scan results to a CSV file for easier viewing.

Columns: file, start_line, end_line, paragraph
"""
with open(csv_path, 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['file', 'start_line', 'end_line', 'paragraph'])
    for file_path, start, end, para in results:
        writer.writerow([file_path, start, end, para])
print(f"Results written to '{csv_path}' ({len(results)} entries).")

if name == 'main':
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <path_to_epub> [output_csv] (default: chinese_leftovers.csv)")
sys.exit(1)

epub_file = sys.argv[1]
output_csv = sys.argv[2] if len(sys.argv) >= 3 else 'chinese_leftovers.csv'
results = find_chinese_in_epub(epub_file)

if results:
    write_results_to_csv(results, output_csv)
else:
    print("No Chinese leftovers found in any <p> tags.")

`

Open Terminal and Run:

python detect_chinese_epub.py D:/Translate/detect_leftover/greenmanor.epub (Change it to where your script and epub are)

or specify a custom output filename:

python detect_chinese_epub.py D:/Translate/detect_leftover/greenmanor.epub my_results.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Chinese words leftover detection. #437

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Chinese words leftover detection. #437

Uh oh!

Uh oh!

AkamashiDesu Apr 23, 2025

Regular expression to match CJK Unified Ideographs (common Chinese characters)

Patterns to detect tags

or specify a custom output filename:

Replies: 0 comments

AkamashiDesu
Apr 23, 2025

Patterns to detect
tags