Authors: Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie, Barbara Plank, François Yvon, Hinrich Schütze
Preprint: arXiv
Large Language Models (LLMs) are known to memorize and recall factual knowledge across multiple languages. However, the process through which this knowledge emerges during pretraining remains unclear.
In this work, we investigate how multilingual factual recall and crosslingual consistency evolve over the course of pretraining, using OLMo-7B and its accompanying checkpoints as a case study. We present empirical results demonstrating two key mechanisms of factual knowledge acquisition:
- Frequency-driven learning (dominant and language-agnostic)
- Crosslingual transfer (notable in early pretraining, some non-English low-frequency facts benefit from it)
.
├── README.md
├── data/
│ ├── klar_variant_full/
│ │ ├── ara_Arab.txt
│ │ ├── cat_Latn.txt
│ │ ├── ell_Grek.txt
│ │ ├── eng_Latn.txt
│ │ ├── fra_Latn.txt
│ │ ├── jpn_Jpan.txt
│ │ ├── kor_Kore.txt
│ │ ├── rus_Cyrl.txt
│ │ ├── spa_Latn.txt
│ │ ├── tur_Latn.txt
│ │ ├── ukr_Cyrl.txt
│ │ └── zho_Hans.txt
│ ├── lang_rel_frequencies.json
│ └── lang_rel_frequencies_infini.json
└── scripts/
├── analysis/
│ ├── embed_extractor.py
│ ├── fact_similarity_compute.py
│ ├── frequency_classifier.py
│ ├── frequency_correctness.py
│ ├── dolma-lang-stat.py
├── factual_recall_computation/
│ └── factual_recall_vllm_with_details.py
└── frequency_computation/
├── frequency_computation_infini.py
└── frequency_computation_wimbd.py
We provide the fact frequencies of facts in KLAR computed by WIMBD and infini gram, respectively. However, we use WIMBD in our project since the statistics are much more reliable.
lang_rel_frequencies.json
: computed by WIMBDlang_rel_frequencies_infini.json
: computed by infini gram
The texts are used to compute sentence representations and cosine similarities among facts of different languages (12 in total)
klar_variant_full
: each line in the file for a language is the translation of the corresponding lines of other languages
- Python 3.9+
- KLAR dataset for tracing multilingual factual recall (please refer to the corresponding github).
- vLLM is used to obtain the factual recall response.
- WIMBD is used to obtain the fact frequencies across languages.
Example: Evaluate the allenai/OLMo-7B-0424-hf
for each 1K step between checkpoints 0 and 50K.
python factual_recall_vllm_with_details.py
--model_name allenai/OLMo-7B-0424-hf
--step_start 0
--step_end 50000
--multiple_of 1000
Using WIMBD (recommended, since it does not do approximation.)
python frequency_computation_wimbd.py
Using infini gram (when it performs approximation, the results can be highly unreliable)
python frequency_computation_infini.py
To build a simple frequency-based classifier for each language, and obtain error breakdown:
python frequency_classifier.py
To generate some visualizations and save the indices for different type of facts and their associated statistics:
python frequency_correctness.py
To compute the sentence-level embedding for each fact in each language across checkpoints, e.g., for each 1K step between checkpoints 0 and 50K. (This script is adapted from MEXA):
python embed_extractor.py
--model_name allenai/OLMo-7B-0424-hf
--step_start 0
--step_end 50000
--multiple_of 1000
--data_path ./datasets
--dataset_names klar_variant_full
--gpus '0'
--num_sents 1500
--save_path ./embd_olmo/
--cache_dir /nfs/datz/olmo_models
--file_ext .txt
To compute the cosine similarity between facts in each language and their English counterparts across checkpoints, e.g., for each 1K step between checkpoint 0 and 50K. (This script is adapted from MEXA):
python fact_similarity_compute.py
--embedding_path ./embd_olmo/
--dataset_names klar_variant_full
--step_start 0
--step_end 50000
--multiple_of 1000
--save_path ./results
--num_sents 1500
--embedding_type embd_lasttoken
--pivot eng_Latn
--file_ext .pkl
If you find our method, code, and scores useful for your research, please consider citing:
KLAR dataset:
@misc{wang2025lost,
title={Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models},
author={Mingyang Wang and Heike Adel and Lukas Lange and Yihong Liu and Ercong Nie and Jannik Strötgen and Hinrich Schütze},
year={2025},
eprint={2504.04264},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.04264}
}
and this paper:
@misc{liu2025tracing,
title={Tracing Multilingual Factual Knowledge Acquisition in Pretraining},
author={Yihong Liu and Mingyang Wang and Amir Hossein Kargaran and Felicia Körner and Ercong Nie and Barbara Plank and François Yvon and Hinrich Schütze},
year={2025},
eprint={2505.14824},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14824}
}