Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains
Abstract: The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs.
Fig: Our framework comprises two key components: (a) LLM Benchmarking, where T are evaluated against R using LLMs across diverse language families and domains; and (b) Uncovering Bias Pattern with LLM-as-a-Judge Evaluation, where potential biases are flagged using linguistic heuristics and semantic analysis, and then verified through LLMs and human annotators. Here, S = Source, R = Reference, T = Translation.
Model | Provider | Context Length |
---|---|---|
gemma2-9b | 8192 tokens | |
gemma-7b | 8192 tokens | |
llama-3-70b | Meta | 8192 tokens |
llama-3-8b | Meta | 8192 tokens |
llama-3.1-70b | Meta | 8192 tokens |
llama-3.1-8b | Meta | 8192 tokens |
llama-3.2-90b-vision | Meta | 128000 tokens |
mixtral-8x7b | Mistral | 32768 tokens |
OLMo-1B | AI2 | 8192 tokens |
Phi-3.5-mini | Microsoft | 8192 tokens |
Phi-2 | Microsoft | 4096 tokens |
Qwen-2.5-0.5B | Alibaba | 8192 tokens |
Qwen-2.5-1.5B | Alibaba | 8192 tokens |
Qwen-2.5-3B | Alibaba | 8192 tokens |
Metric | Description | Direction |
---|---|---|
BLEU | N-gram overlap with reference | ↑ |
chrF | Character-level F-score | ↑ |
TER | Translation Edit Rate | ↓ |
BERTScore | Semantic similarity using BERT embeddings | ↑ |
WER | Word Error Rate | ↓ |
CER | Character Error Rate | ↓ |
ROUGE | Overlapping n-grams: ROUGE-1, ROUGE-2, ROUGE-L | ↑ |
Legend: ↑ Higher is better, ↓ Lower is better
We use a combination of general-purpose and domain-specific multilingual benchmark datasets to evaluate translation quality across diverse linguistic and contextual settings:
Dataset | Languages | Size | Domain | Fields | Splits |
---|---|---|---|---|---|
ELRCMedical | English + 21 EU languages | 100K–1M | Medical | doc_id , lang , source_text , target_text |
None (manual) |
MultiEURLEX | 23 EU languages | 65K docs | Legal | doc_id , text , labels |
Train (55K), Dev/Test (5K each) |
Lit-Corpus | Kazakh, Russian, English | 71K pairs | Literature | source_text , target_text , X_lang , y_lang |
None |
BanglaNMT | Bangla, English | 2.38M pairs | General | bn , en |
Train (2.38M), Val (597), Test (1K) |
WMT-19 | Multilingual | 100M–1B | General | source_text , target_text , X_lang , y_lang |
Train, Val |
WMT-18 | Multilingual | 100M–1B | General | source_text , target_text , X_lang , y_lang |
Train, Val, Test |
Code Pair | Language Names |
---|---|
cs-en / en-cs | Czech ↔ English |
de-en / en-de | German ↔ English |
fi-en / en-fi | Finnish ↔ English |
fr-de / de-fr | French ↔ German |
gu-en / en-gu | Gujarati ↔ English |
kk-en / en-kk | Kazakh ↔ English |
lt-en / en-lt | Lithuanian ↔ English |
ru-en / en-ru | Russian ↔ English |
zh-en / en-zh | Chinese ↔ English |
et-en / en-et | Estonian ↔ English |
tr-en / en-tr | Turkish ↔ English |
bn-en / en-bn | Bangla ↔ English |
To strengthen the evaluation beyond automated metrics, we conducted structured human annotation of 1,439 translation-reference pairs. Each instance was annotated along three axes: (i) bias flags from our heuristic-semantic framework, (ii) bias assessments by an LLM-as-a-Judge module, and (iii) gold-standard decisions by independent human reviewers. Each record includes the source sentence, reference translation, model output, and categorical bias labels (gender, cultural, sociocultural, racial, religious), along with common translation issues such as grammatical inconsistencies, pronoun shifts, semantic distortions, and hallucinated biases.
These examples are stratified into: (i) 294 undetected bias cases where no system flagged bias, (ii) 294 disagreement cases where only the heuristic flagged bias, and (iii) 851 agreement cases where both systems confirmed bias. This dataset provides a robust resource for bias-aware translation benchmarking, model comparison, and interpretability research in multilingual NLP.