Skip to content

tiiuae/3LM-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

3LM Logo

3LM Benchmark

Arabic Evaluation in STEM and Code

GitHub · Datasets · Paper · Blog


🧠 What is 3LM?

3LM (علم) is the first Arabic-native benchmark suite dedicated to scientific reasoning and programming. It contains three sub-benchmarks:

  1. Native STEM: 865 multiple-choice questions (MCQs) from real Arabic educational materials (biology, physics, chemistry, math, and geography).
  2. Synthetic STEM: 1,744 high-difficulty MCQs generated using the YourBench pipeline from Arabic educational texts.
  3. Arabic Code Benchmarks: HumanEval and MBPP datasets translated into Arabic via GPT-4o and validated with backtranslation and human review.

These datasets are designed to evaluate Arabic LLMs in domains that require structured reasoning and formal knowledge which are often overlooked in existing benchmarks.


🚀 Key Results

We evaluated over 40 LLMs, including Arabic-centric, multilingual, and bilingual models. Here are some highlights:

  • Gemma-3-27B achieved the highest average performance on the STEM completion benchmark.
  • Qwen2.5-72B excelled in MCQ-based evaluation across all domains.
  • Arabic code generation performance strongly correlates with English code generation (r ≈ 0.97).
  • Robustness tests (e.g., distractor perturbation) revealed instruct-tuned models are more resilient than base models.

Key results

You can explore full results and correlation analyses in the paper and via our scripts.


📦 Evaluate Your Model

  1. Clone the 3LM-benchmark
git clone https://github.com/tiiuae/3LM-benchmark.git
  1. Setup your conda environment and install required modules
conda create -n 3lm_eval python==3.11
conda activate 3lm_eval
pip install -e frameworks/lighteval  #add [vllm] if you intend to use vllm as backend
pip install -e frameworks/evalplus-arabic 
huggingface-cli login --token <HF-token> ## ignore if evaluated model is stored locally  
  1. Launch evaluation scripts
python launch_eval.py -c examples/lighteval_3lm.yaml ## synthetic + native
python launch_eval.py -c examples/lighteval_native.yaml ## native
python launch_eval.py -c examples/lighteval_synthetic.yaml ## synthetic
python launch_eval.py -c examples/evalplus_arabic_code.yaml ## code

Modify the configs in the above paths to specify the model and other evaluation parameters (e.g., chat_template, engine, etc.).

All evaluation is built on:

  • lighteval for STEM QA
  • evalplus for code pass@1 metrics

📁 The datasets are available on HuggingFace:

📁 Code datasets are in frameworks/evalplus-arabic/evalplus/data/data_files/ and include:

  • HumanEvalPlus.jsonl.gz
  • MBPPPlus.jsonl.gz

🪪 Licence

Falcon LLM Licence

📝 Citation

If you use 3LM in your research, please cite our paper:

@article{boussaha2025threeLM,
  title={3LM: Bridging Arabic, STEM, and Code through Benchmarking},
  author={Boussaha, Basma El Amel and AlQadi, Leen and Farooq, Mugariya and Alsuwaidi, Shaikha and Campesan, Giulia and Alzubaidi, Ahmed and Alyafeai, Mohammed and Hacid, Hakim},
  journal={arXiv preprint arXiv:2507.15850},
  year={2025}
}