FilBench: Open LLM Eval Suite for Filipino

🥇 Leaderboard | 💻 Evaluation Runner (Lighteval) | 📄 Paper

This repository contains all relevant tools and experiments for FilBench, an Open LLM Evaluation Suite and leaderboard for Filipino. We curated an evaluation benchmark consisting of four major capabilities: Cultural Knowledge, Generation, Reading Comprehension, and Classical NLP. Then, we evaluated over 20 models of different parameters, model families, and multilingual support.

📰 News

[2025-08-20] FilBench was accepted at EMNLP 2025 Main! See you in Suzhou!
[2025-08-15] FilBench is now an official part of HuggingFace's Community Tasks in Lighteval! You can also find out more about the project in this HuggingFace blog post.
[2025-08-01] We officially introduce FilBench! You can read more details in our paper.

🔧 Installation

The installation process assumes you have uv and Python 3.12 installed. First, clone this repository and install all dependencies and our fork of lighteval:

git clone git@github.com:filbench/filbench-eval.git
cd filbench-eval
git submodule update
uv sync

These steps will clone our lighteval fork as a submodule and install necessary dependencies. In the end, you should have access to the following tools:

lighteval: this is the main evaluation runner to use for launching evaluation jobs.
filbench: a lightweight CLI for computing the FilBench score and reporting results to the leaderboard.

You can check if your installation works by running the following commands:

# Check if filbench installation works
filbench --help

# Check if lighteval installation works
cd lighteval
python3 -m lighteval tasks inspect "filbench|cebuaner_ceb_mcf|0|0" \
    --num-samples 1 \
    --custom-tasks community_tasks/filbench_evals.py

Important

You must run the lighteval command (1) within the lighteval submodule and (2) using the python -m ... prefix. If you encounter any installation issues, please open an Issue in this repository.

👩‍💻 Usage

Running evaluations on FilBench

In order to run all evaluations on FilBench, we recommend running the following command:

cd lighteval
export HF_ORG=<...>
# For models in HuggingFace and accessible via vLLM
cat examples/tasks/all_filbench_tasks.txt | xargs -I {} \
    python -m lighteval vllm "pretrained=<MODEL_NAME>" {} \
    --push-to-hub \
    --results-org $HF_ORG \
    --custom-tasks community_tasks/filbench_evals.py

This command will then run all tasks in FilBench on MODEL_NAME, and upload the results to HF_ORG. When run in parallel, the shortest task can take around 5 minutes and the longest task can take around 2 hours.

Computing the FilBench Score

Your results should be saved in HF_ORG/MODEL_NAME. For example, the results for aisingapore/SEA-LION-v3.5-70B-R are stored in UD-Filipino/details_aisingapore__Llama-SEA-LION-v3.5-70B-R_private. To compute the FilBench score, run the following command:

filbench compute-score <HF_ORG>/<MODEL_NAME>
# For example:
filbench compute-score UD-Filipino/details_aisingapore__Llama-SEA-LION-v3.5-70B-R_private

Submitting to the Leaderboard

We also maintain a leaderboard to track the progress in Filipino NLP. By default, this command will output a JSON file called scores_<HF_ORG>___<MODEL_NAME>.json that contains the FilBench score and its breakdown across categories and tasks. You can then submit these results by running the command below and following the prompts:

filbench submit "scores_<HF_ORG>__MODEL_NAME.json"
# 🤗 Model Name or HuggingFace ID (e.g., Qwen/Qwen3-32B):
# ...

This will then make a PR to the UD-Filipino/filbench-results-submission dataset. The approval process is done manually, and we might contact you to clarify a few things.

Note

If you want to update your scores for a specific model in the leaderboard, you just need to rerun the submit command and input the same organization and model name. Internally, we hash these variables together and show the latest result.

Tip

You can set the --dry-run flag to double-check whether the details you entered are correct.

📜 Team

This work was done by @ljvmiranda921, @elyanah-aco, @connermanuel, @jcblaisecruz02, and @imperialite. For any questions, please reach out to us via filbench-eval@googlegroups.com or through our GitHub Issues. To cite our work, please use the following bibTeX entry:

@article{filbench,
  title={Fil{B}ench: {C}an {LLM}s {U}nderstand and {G}enerate {F}ilipino?},
  author={Miranda, Lester James V and Aco, Elyanah and Manuel, Conner and Cruz, Jan Christian Blaise and Imperial, Joseph Marvin},
  journal={arXiv preprint arXiv:2508.03523},
  year={2025}
}

🙏 Acknowledgments

We would like to thank Cohere Labs for providing credits through the Cohere Research Grant to run the Aya model series, and Together AI for additional computational credits for running several open models. We also acknowledge the Hugging Face team, particularly the OpenEvals team ( @clefourrier and @NathanHB) and @davanstrien, for their support in publishing the FilBench blog post.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
analysis		analysis
lighteval @ 7ad5279		lighteval @ 7ad5279
src/filbench_eval		src/filbench_eval
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

FilBench: Open LLM Eval Suite for Filipino

📰 News

🔧 Installation

👩‍💻 Usage

Running evaluations on FilBench

Computing the FilBench Score

Submitting to the Leaderboard

📜 Team

🙏 Acknowledgments

About

Uh oh!

Releases

Languages

Uh oh!

Uh oh!

filbench/filbench-eval

Folders and files

Latest commit

History

Repository files navigation

FilBench: Open LLM Eval Suite for Filipino

📰 News

🔧 Installation

👩‍💻 Usage

Running evaluations on FilBench

Computing the FilBench Score

Submitting to the Leaderboard

📜 Team

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages