Skip to content

YuyangSunshine/bioprotocolbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BioProBench Logo

BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

ArXiv Hugging Face License: CC BY 4.0 PRs Welcome

BioProBench is the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning, specifically designed for large language models (LLMs). It moves beyond simple QA to encompass a comprehensive suite of tasks critical for procedural text comprehension.


Biological protocols are the fundamental bedrock of reproducible and safe life science research. While LLMs have shown remarkable capabilities on general tasks, their systematic evaluation on highly specialized, accuracy-critical, and inherently procedural texts like biological protocols remains limited. BioProBench fills this gap by providing a robust framework to evaluate LLMs on diverse aspects of protocol understanding and reasoning.

BioProBench Logo

BioProBench features:

  • 📚 Large-scale Data: Built upon 27K original biological protocols, yielding nearly 556K high-quality structured instances.

  • 🎯 Comprehensive Tasks: A suite of ** five core tasks** challenging LLMs on different facets of procedural understanding and generation:

    • Protocol Question Answering (PQA)
    • Step Ordering (ORD)
    • Error Correction (ERR)
    • Protocol Generation (GEN)
    • Protocol Reasoning (REA)
  • 🧬 Broad Domain Coverage: Data sourced from 6 major repositories and covering 16 biological subdomains.

  • 🔬 Standardized Evaluation: A robust framework combining standard NLP metrics with novel domain-specific measures for accurate performance quantification.


🚀 Motivation

Biological protocols are the operational blueprint for experiments. As biological research increasingly leverages automation and AI, the ability of AI systems to understand and reason about these complex procedures is paramount. Current LLMs, while powerful, face significant challenges:

  • Limited Procedural Understanding:
  • LLMs struggle with the temporal dependencies, conditional logic, and specific requirements embedded within protocols.
  • Lack of Systematic Evaluation: There has been a lack of large-scale, multi-task benchmarks specifically designed to diagnose LLMs' limitations on procedural biological texts.
  • Bridging the Gap: Developing AI systems capable of safely automating and even optimizing experiments requires models that can reliably interpret and generate protocols.

BioProBench addresses these challenges by providing the necessary data and tasks for comprehensive evaluation and driving the development of more capable models.


📊 Dataset Structure

BioProBench Logo

BioProBench provides a layered data design to support various model development stages:

  • A raw corpus of 27K protocols for pretraining or RAG applications.
  • A substantial downstream training set of over 550K structured instances across the five fine-grained tasks for model adaptation.
  • A held-out test set of 1,000 examples per task for standardized benchmarking.

The dataset and code are publicly available:


✏️ Inference

For researchers who wish to reproduce our results or test on models not covered in this paper, we provide an easy-to-use inference script in Scripts/ directory.

Researchers can choose to run inference tests using either the API or a local model (e.g., via Huggingface Transformers).

Use API:

cd Scripts
python generate_response.py

Before doing so, you need to modify the content of generate_response.py to specify your API key, model name, base URL, and other configuration details:

# ================================
# User Configuration
# ================================
        
API_KEY = 'YOUR API-KEY'       # Replace this with your API key
BASE_URL = 'https://api.openai.com/v1'  # Replace with your base URL if different
MODEL_NAME = 'o3-mini'              # Replace with your preferred model
TASK_NAME = 'PQA'                   # Task name used in file paths ('PQA', 'ORD', 'ERR', 'REA-ERR', 'GEN', 'REA-GEN')

Use Local Models:

cd Scripts
python generate_response_local.py

Before doing so, you need to modify the content of generate_response_local.py to specify your model name and other configuration details:

# ================================
# User Configuration
# ================================

MODEL_NAME = 'meta-llama/Meta-Llama-3-8B-Instruct'                 # or other models from huggingface or local path
TASK_NAME = 'PQA'                   # Task name used in file paths ('PQA', 'ORD', 'ERR', 'REA-ERR', 'GEN', 'REA-GEN')
TEST_FILE_PATH = f"../Data/{TASK_NAME.split('-')[-1]}_test.json"

🧪 Evaluation Metrics

We employ a hybrid evaluation framework that combines standard NLP metrics with novel domain-specific measures to accurately quantify model performance across all tasks.

Each task in BioProBench includes standalone evaluation codes within the Metrics/ directory. To evaluate your model's outputs (Assert model's output is stored in 'generated_response' of each item of JSON file):

✅ Step 1: Locate the evaluation file

Each task corresponds to one script:

Task Codes Metrics
Protocol Generation (GEN) ./Metrics/GEN.py BLEU, Keyword-based, Embedding-based, etc.
Protocol QA (PQA) ./Metrics/PQA.py Accuracy, Brier Score, etc.
Error Correction (ERR) ./Metrics/ERR.py Accuracy, Precision, Recall, F1, etc.
Step Ordering (ORD) ./Metrics/ORD.py Exact Match, Kendall's tau, etc.
Experimental Reasoning (REA) ./Metrics/REA-ERR.py Accuracy, Precision, Recall, F1, Consistency, etc.

✅ Step 2: Modify the script

Open the corresponding evaluation script (e.g., ERR.py) and manually set the file path to your model’s reponse:

# Inside ERR.py

def main():
    output_file_path = "/absolute/path/to/model_response.json"  # ← Replace this!
    ...

Example:

output_file_path = "./ERR_test_o3-mini.json"

Then run the script:

cd Metrics
python ERR.py

Output Metrics

Each script prints evaluation results such as:

  • Accuracy, Precision, Recall, F1
  • Step-level metrics (e.g., Step Recall, Redundancy Penalty)
  • Ordering metrics (e.g., Kendall’s Tau)
  • Parsing failure rates

🔬 Key Findings

We evaluated 12 mainstream open-source and closed-source LLMs on BioProBench. Our key findings reveal significant insights into the current capabilities and limitations of LLMs on biological protocols:

  • Surface vs. Deep Understanding: While top models perform reasonably well on tasks requiring surface understanding like Protocol Question Answering (e.g., ~70% PQA-Acc.) and Error Correction (e.g., >64% ERR F1), they struggle significantly with tasks demanding deeper procedural understanding and structured generation.
  • Challenges in Reasoning and Generation: Performance drops considerably on Step Ordering (e.g., ORD-EM ~50%) and Protocol Generation (e.g., GEN-BLEU <15%), highlighting the difficulty for LLMs in managing temporal dependencies and generating coherent, accurate procedures.
  • Model Variances: Comparisons show diverse performance across models. Certain open-source models approach the performance of closed-source models on some tasks, while smaller, bio-specific models often lag behind general LLMs on complex procedural content, suggesting limitations in capacity for capturing intricate dependencies.

Overall, our findings underscore that robust procedural reasoning within biological protocols represents a significant challenge for current LLMs.


🤝 Contributing

We welcome contributions to enhance BioProBench, including:

  • New protocol sources
  • 🧪 Additional biological domains
  • 🧠 Novel evaluation tasks
  • 📝 Annotation improvements

📜 Citation

@misc{bioprotocolbench2025,
  title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
  author={Yuyang Liu⋆, Liuzhenghao Lv⋆ Xiancheng Zhang, Li Yuan1, Yonghong Tian.},
  year={2025},
  url={https://arxiv.org/pdf/2505.07889}
}

📝 More Research

The following are two important works of our group:

  • ChemCoTBench: A step-by-step, application-oriented, and high-quality benchmark for evaluating LLM reasoning in chemical applications, click to view the leaderboard
  • ProLLaMA: A multitask protein language model enhanced by the Evolutionary Protein Generation Framework (EPGF), the project code is on GitHub

📧 Contact

For dataset access or collaboration inquiries: sunshineliuyuyang@gmail.com

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages