🥇 Leaderboard | 💻 Evaluation Runner (Lighteval) | 📄 Paper
This repository contains all relevant tools and experiments for FilBench, an Open LLM Evaluation Suite and leaderboard for Filipino. We curated an evaluation benchmark consisting of four major capabilities: Cultural Knowledge, Generation, Reading Comprehension, and Classical NLP. Then, we evaluated over 20 models of different parameters, model families, and multilingual support.
- [2025-08-20] FilBench was accepted at EMNLP 2025 Main! See you in Suzhou!
- [2025-08-15] FilBench is now an official part of HuggingFace's Community Tasks in Lighteval! You can also find out more about the project in this HuggingFace blog post.
- [2025-08-01] We officially introduce FilBench! You can read more details in our paper.
The installation process assumes you have uv and Python 3.12 installed.
First, clone this repository and install all dependencies and our fork of lighteval:
git clone git@github.com:filbench/filbench-eval.git
cd filbench-eval
git submodule update
uv syncThese steps will clone our lighteval fork as a submodule and install necessary dependencies.
In the end, you should have access to the following tools:
- lighteval: this is the main evaluation runner to use for launching evaluation jobs.
- filbench: a lightweight CLI for computing the FilBench score and reporting results to the leaderboard.
You can check if your installation works by running the following commands:
# Check if filbench installation works
filbench --help
# Check if lighteval installation works
cd lighteval
python3 -m lighteval tasks inspect "filbench|cebuaner_ceb_mcf|0|0" \
--num-samples 1 \
--custom-tasks community_tasks/filbench_evals.pyImportant
You must run the lighteval command (1) within the lighteval submodule and (2) using the python -m ... prefix. If you encounter any installation issues, please open an Issue in this repository.
In order to run all evaluations on FilBench, we recommend running the following command:
cd lighteval
export HF_ORG=<...>
# For models in HuggingFace and accessible via vLLM
cat examples/tasks/all_filbench_tasks.txt | xargs -I {} \
python -m lighteval vllm "pretrained=<MODEL_NAME>" {} \
--push-to-hub \
--results-org $HF_ORG \
--custom-tasks community_tasks/filbench_evals.pyThis command will then run all tasks in FilBench on MODEL_NAME, and upload the results to HF_ORG.
When run in parallel, the shortest task can take around 5 minutes and the longest task can take around 2 hours.
Your results should be saved in HF_ORG/MODEL_NAME.
For example, the results for aisingapore/SEA-LION-v3.5-70B-R are stored in UD-Filipino/details_aisingapore__Llama-SEA-LION-v3.5-70B-R_private.
To compute the FilBench score, run the following command:
filbench compute-score <HF_ORG>/<MODEL_NAME>
# For example:
filbench compute-score UD-Filipino/details_aisingapore__Llama-SEA-LION-v3.5-70B-R_privateWe also maintain a leaderboard to track the progress in Filipino NLP.
By default, this command will output a JSON file called scores_<HF_ORG>___<MODEL_NAME>.json that contains the FilBench score and its breakdown across categories and tasks.
You can then submit these results by running the command below and following the prompts:
filbench submit "scores_<HF_ORG>__MODEL_NAME.json"
# 🤗 Model Name or HuggingFace ID (e.g., Qwen/Qwen3-32B):
# ...This will then make a PR to the UD-Filipino/filbench-results-submission dataset.
The approval process is done manually, and we might contact you to clarify a few things.
Note
If you want to update your scores for a specific model in the leaderboard, you just need to rerun the submit command and input the same organization and model name.
Internally, we hash these variables together and show the latest result.
Tip
You can set the --dry-run flag to double-check whether the details you entered are correct.
This work was done by @ljvmiranda921, @elyanah-aco, @connermanuel, @jcblaisecruz02, and @imperialite. For any questions, please reach out to us via filbench-eval@googlegroups.com or through our GitHub Issues. To cite our work, please use the following bibTeX entry:
@article{filbench,
title={Fil{B}ench: {C}an {LLM}s {U}nderstand and {G}enerate {F}ilipino?},
author={Miranda, Lester James V and Aco, Elyanah and Manuel, Conner and Cruz, Jan Christian Blaise and Imperial, Joseph Marvin},
journal={arXiv preprint arXiv:2508.03523},
year={2025}
}We would like to thank Cohere Labs for providing credits through the Cohere Research Grant to run the Aya model series, and Together AI for additional computational credits for running several open models. We also acknowledge the Hugging Face team, particularly the OpenEvals team ( @clefourrier and @NathanHB) and @davanstrien, for their support in publishing the FilBench blog post.