3LM Benchmark

Arabic Evaluation in STEM and Code

🧠 What is 3LM?

3LM (علم) is the first Arabic-native benchmark suite dedicated to scientific reasoning and programming. It contains three sub-benchmarks:

Native STEM: 865 multiple-choice questions (MCQs) from real Arabic educational materials (biology, physics, chemistry, math, and geography).
Synthetic STEM: 1,744 high-difficulty MCQs generated using the YourBench pipeline from Arabic educational texts.
Arabic Code Benchmarks: HumanEval and MBPP datasets translated into Arabic via GPT-4o and validated with backtranslation and human review.

These datasets are designed to evaluate Arabic LLMs in domains that require structured reasoning and formal knowledge which are often overlooked in existing benchmarks.

🚀 Key Results

We evaluated over 40 LLMs, including Arabic-centric, multilingual, and bilingual models. Here are some highlights:

Gemma-3-27B achieved the highest average performance on the STEM completion benchmark.
Qwen2.5-72B excelled in MCQ-based evaluation across all domains.
Arabic code generation performance strongly correlates with English code generation (r ≈ 0.97).
Robustness tests (e.g., distractor perturbation) revealed instruct-tuned models are more resilient than base models.

You can explore full results and correlation analyses in the paper and via our scripts.

📦 Evaluate Your Model

Clone the 3LM-benchmark

git clone https://github.com/tiiuae/3LM-benchmark.git

Setup your conda environment and install required modules

conda create -n 3lm_eval python==3.11
conda activate 3lm_eval
pip install -e frameworks/lighteval  #add [vllm] if you intend to use vllm as backend
pip install -e frameworks/evalplus-arabic 
huggingface-cli login --token <HF-token> ## ignore if evaluated model is stored locally

Launch evaluation scripts

python launch_eval.py -c examples/lighteval_3lm.yaml ## synthetic + native
python launch_eval.py -c examples/lighteval_native.yaml ## native
python launch_eval.py -c examples/lighteval_synthetic.yaml ## synthetic
python launch_eval.py -c examples/evalplus_arabic_code.yaml ## code

Modify the configs in the above paths to specify the model and other evaluation parameters (e.g., chat_template, engine, etc.).

All evaluation is built on:

lighteval for STEM QA
evalplus for code pass@1 metrics

📁 The datasets are available on HuggingFace:

SyntheticQA: https://huggingface.co/datasets/tiiuae/SyntheticQA
NativeQA: https://huggingface.co/datasets/tiiuae/NativeQA
NativeQA-RDP: https://huggingface.co/datasets/tiiuae/NativeQA-RDP
Evalplus-Arabic: https://huggingface.co/datasets/tiiuae/evalplus-arabic

📁 Code datasets are in frameworks/evalplus-arabic/evalplus/data/data_files/ and include:

HumanEvalPlus.jsonl.gz
MBPPPlus.jsonl.gz

🪪 Licence

Falcon LLM Licence

📝 Citation

If you use 3LM in your research, please cite our paper:

@article{boussaha2025threeLM,
  title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
  author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
  journal={arXiv preprint arXiv:2507.15850},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
assets		assets
examples		examples
frameworks		frameworks
tasks		tasks
.gitignore		.gitignore
README.md		README.md
launch_eval.py		launch_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

3LM Benchmark

Arabic Evaluation in STEM and Code

🧠 What is 3LM?

🚀 Key Results

📦 Evaluate Your Model

🪪 Licence

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 4

Languages

tiiuae/3LM-benchmark

Folders and files

Latest commit

History

Repository files navigation

3LM Benchmark

Arabic Evaluation in STEM and Code

🧠 What is 3LM?

🚀 Key Results

📦 Evaluate Your Model

🪪 Licence

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages