An ever-evolving benchmark for LLMs and LMMs.
(Recommended) Create a new virtual environment and activate it. Some packages require Python>=3.11, therefore we suggest using the following:
conda create -n onebench python=3.11 -y
conda activate onebench
Install the required packages:
python -m pip install -r requirements.txt
Install ONEBench in editable mode:
python -m pip install -e .
Test the installation:
python -c "import onebench"
[Optional] Upgrade the Google Cloud SDK:
brew install python@3.11
export CLOUDSDK_PYTHON=$(which python3.11)
gcloud components update
Authenticate to Google Cloud:
gcloud init
Download the HELM data:
python llm/download_helm.py
Download the Open LLM Leaderboard data:
python llm/download_open_llm_leaderboard.py
Download the LMSYS Chatbot Arena data:
python llm/download_chatbot_arena.py
The VLM results are in the data/vlm/{dataset}
directory, where dataset corresponds to vhelm
and lmms-eval
. The individual dataset a-matrices are located in data/vlm/{dataset}/binary
and data/vlm/{dataset}/numeric
. The results from Prometheus2 are located in data/vlm/{dataset}/pairwise_num
.
[TODO]: Add instructions for json downloads, a matrix creation, prometheus scripts and capability querying.
@inprocessings{ghosh2025onebench,
title={ONEBench to test them all: Sample-level benchmarking over open-ended capabilities},
author={Ghosh, Adhiraj and Dziadzio, Sebastian and Prabhu, Ameya and Udandarao, Vishaal and Albanie, Samuel and Bethge, Matthias},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics },
year={2025}
}
Code: MIT. Check LICENSE
.