Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains

Abstract: The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs.

Fig: Our framework comprises two key components: (a) LLM Benchmarking, where T are evaluated against R using LLMs across diverse language families and domains; and (b) Uncovering Bias Pattern with LLM-as-a-Judge Evaluation, where potential biases are flagged using linguistic heuristics and semantic analysis, and then verified through LLMs and human annotators. Here, S = Source, R = Reference, T = Translation.

🧠 Models

Model	Provider	Context Length
gemma2-9b	Google	8192 tokens
gemma-7b	Google	8192 tokens
llama-3-70b	Meta	8192 tokens
llama-3-8b	Meta	8192 tokens
llama-3.1-70b	Meta	8192 tokens
llama-3.1-8b	Meta	8192 tokens
llama-3.2-90b-vision	Meta	128000 tokens
mixtral-8x7b	Mistral	32768 tokens
OLMo-1B	AI2	8192 tokens
Phi-3.5-mini	Microsoft	8192 tokens
Phi-2	Microsoft	4096 tokens
Qwen-2.5-0.5B	Alibaba	8192 tokens
Qwen-2.5-1.5B	Alibaba	8192 tokens
Qwen-2.5-3B	Alibaba	8192 tokens

📏 Evaluation Metrics

Metric	Description	Direction
BLEU	N-gram overlap with reference	↑
chrF	Character-level F-score	↑
TER	Translation Edit Rate	↓
BERTScore	Semantic similarity using BERT embeddings	↑
WER	Word Error Rate	↓
CER	Character Error Rate	↓
ROUGE	Overlapping n-grams: ROUGE-1, ROUGE-2, ROUGE-L	↑

Legend: ↑ Higher is better, ↓ Lower is better

📚 Translation Performance Evaluation Datasets

We use a combination of general-purpose and domain-specific multilingual benchmark datasets to evaluate translation quality across diverse linguistic and contextual settings:

Dataset	Languages	Size	Domain	Fields	Splits
ELRCMedical	English + 21 EU languages	100K–1M	Medical	`doc_id`, `lang`, `source_text`, `target_text`	None (manual)
MultiEURLEX	23 EU languages	65K docs	Legal	`doc_id`, `text`, `labels`	Train (55K), Dev/Test (5K each)
Lit-Corpus	Kazakh, Russian, English	71K pairs	Literature	`source_text`, `target_text`, `X_lang`, `y_lang`	None
BanglaNMT	Bangla, English	2.38M pairs	General	`bn`, `en`	Train (2.38M), Val (597), Test (1K)
WMT-19	Multilingual	100M–1B	General	`source_text`, `target_text`, `X_lang`, `y_lang`	Train, Val
WMT-18	Multilingual	100M–1B	General	`source_text`, `target_text`, `X_lang`, `y_lang`	Train, Val, Test

🌐 Language Pairs

Code Pair	Language Names
cs-en / en-cs	Czech ↔ English
de-en / en-de	German ↔ English
fi-en / en-fi	Finnish ↔ English
fr-de / de-fr	French ↔ German
gu-en / en-gu	Gujarati ↔ English
kk-en / en-kk	Kazakh ↔ English
lt-en / en-lt	Lithuanian ↔ English
ru-en / en-ru	Russian ↔ English
zh-en / en-zh	Chinese ↔ English
et-en / en-et	Estonian ↔ English
tr-en / en-tr	Turkish ↔ English
bn-en / en-bn	Bangla ↔ English

🧪 Human Evaluation and Our Dataset Contribution

To strengthen the evaluation beyond automated metrics, we conducted structured human annotation of 1,439 translation-reference pairs. Each instance was annotated along three axes: (i) bias flags from our heuristic-semantic framework, (ii) bias assessments by an LLM-as-a-Judge module, and (iii) gold-standard decisions by independent human reviewers. Each record includes the source sentence, reference translation, model output, and categorical bias labels (gender, cultural, sociocultural, racial, religious), along with common translation issues such as grammatical inconsistencies, pronoun shifts, semantic distortions, and hallucinated biases.

These examples are stratified into: (i) 294 undetected bias cases where no system flagged bias, (ii) 294 disagreement cases where only the heuristic flagged bias, and (iii) 851 agreement cases where both systems confirmed bias. This dataset provides a robust resource for bias-aware translation benchmarking, model comparison, and interpretability research in multilingual NLP.

📂 Download Human-Annotated Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
assets		assets
dataset		dataset
dump_results		dump_results
evalution_results		evalution_results
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains

🧠 Models

📏 Evaluation Metrics

📚 Translation Performance Evaluation Datasets

🌐 Language Pairs

🧪 Human Evaluation and Our Dataset Contribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

faiyazabdullah/TranslationTangles

Folders and files

Latest commit

History

Repository files navigation

Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains

🧠 Models

📏 Evaluation Metrics

📚 Translation Performance Evaluation Datasets

🌐 Language Pairs

🧪 Human Evaluation and Our Dataset Contribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages