Skip to content

yunx-z/MLRC-Bench

 
 

Repository files navigation

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

MLRC-Bench (arxiv, leaderboard) is a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competition problems by proposing and implementing novel ideas into code.

The first release of our benchmark includes 7 tasks adapted from recent Machine Learning conference competitions.

Setup

First, install MLAB agent environment.

Create a conda enviroment named mlab.

Then install the MLAgentBench package with

pip install -e .
pip install openai

Install dependencies with python 3.10 by running

bash install.sh

Then, install each task's environment by running the following commands:

cd MLAgentBench/benchmarks_base/${TASK_NAME}/scripts
conda env create -f environment.yml --name ${TASK_NAME}
conda activate ${TASK_NAME}
cd -
pip install -e .
pip install openai

(Optional) Some competition requires the following action: for Kaggle datasets, you need to set up Kaggle API and authentication (~/.kaggle/kaggle.json) as described here. You may also need to provide manual consent to the rules of specific competitions by following the prompts.

Tasks

Each task is a folder in MLAgentBench/benchmarks_base/, under which the env/ folder contains files that the research agent will see at the beginning, and script/ folder contains additional hidden files such as prepare.py for downloading data.

To launch MLAB agent, run bash launch.sh ${TASK_NAME} ${MODEL} ${GPU_ID} where supported MODEL can be checked out here. You will need to specify MY_OPENAI_API_KEY and MY_AZURE_OPENAI_ENDPOINT as environment variables to use openai models.

Instructions for Adding New Tasks (PRs Welcomed!):

Steps:

  • Fork this github repo to your own github space.
  • Complete steps in Setup Section for the MLAgentBench packages.
  • Create a new task folder under MLAgentBench/benchmarks_base, following the template.
  • add runtime and performance of your baseline method in MLAgentBench/constants.py (Repeat your run multiple times to ensure consistency; the score should remain relatively stable across runs.)
  • Submit a pull request.

Here are the commands to test your newly added tasks:

# prepare conda environment and data
cd MLAgentBench/benchmarks_base/${TASK_NAME}/scripts/
conda env create -f environment.yml
conda activate ${TASK_NAME}
# We will install MLAgentBench and openai packages in the newly created conda environment
python prepare.py

# evaluate baseline method on validation set
cd ../env
python main.py -m my_method -p dev

# evaluate baseline method on test set
cp -r ../scripts/test_data/* data/ # prepare test data (updated)
cp ../scripts/test_constants.py constants.py # prepare test-time configuration
python main.py -m my_method -p test

Also if possible, please include a background.txt file under scripts folder with excerpt from relevant papers or technical reports written by competition participants (besides baseline paper) containing description and core code for relevant methods. See this for an example on llm-merging task. This info will be used to inspire LLM agents for better solutions.

The goal of refactored code is to achieve the following requirements:

Basically this command stays constant: python main.py -m my_method -p dev/test and then any code that could deal with evaluation metrics should be read_only and need to make sure read_only files don’t contain stuff that are necessary for training and that the agent could need to modify for their implementation.

Others:

  • The LLM agent will be able to “see” all files under env/ folder so make sure not to put any test-time information (including test data and model name used in test phases) there to avoid LLM agent “cheating”.
  • Also put all test data under scripts/test_data
  • Your code should not attempt to access internet. Any pretrained models, datasets should be downloaded beforehand by prepare.py.

Acknowledgements

This repo is based on MLAgentBench, and we thank the authors for their foundational work.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Shell 1.4%
  • Other 0.2%