Skip to content

IDEA-XL/RAPM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Official implementation of the paper "Rethinking Text-based Protein Understanding: Retrieval or LLM?"

Code License Data License Paper Link GitHub Link Huggingface Link

📖 Abstract:


In recent years, protein-text language models are widely used to solve protein understanding tasks. Current approaches focus on integrating protein-related knowledge into LLMs through continued pretraining or multi-modal alignment, enabling LLMs to jointly understand protein sequences and textual descriptions.

However, by analysing existing model architectures and text-based protein understanding benchmarks, our analysis reveals significant data leakage in current text-based protein benchmarks, and metrics like ROUGE, BLEU are inadequate for evaluation in this domain.

To address these limitations, we reorganize existing datasets and introduce a novel OOD dataset, Prot-Inst-OOD, along with a new evaluation metric, Entity-BLEU. Furthermore, we propose a retrieval-enhanced method that significantly outperforms fine-tuned LLMs in protein-to-text generation, demonstrating both high accuracy and efficiency in training-free scenarios.

alt text

‼️ Data Leakage in Existing Protein-to-Text Benchmark


We evaluated four widely used benchmarks for text-based protein understanding: the protein comprehension tasks from Mol-Instructions [1], UniProtQA [2], Swiss-Prot Protein Caption [3], and ProteinKG25 [4].

Show references

[1] Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
[2] BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine
[3] ProtT3: Protein-to-Text Generation for Text-based Protein Understanding
[4] OntoProtein: Protein Pretraining With Ontology Embedding

For sequence retrieval, we used MMSeqs2 with the following command:

mmseqs easy-search --max-accept 1 -e 1e5 -v 0 test_seqs.fasta train_seqs.fasta result.m8 tmp  

For each protein sequence in the test set, the label of its most similar counterpart in the training set was assigned as the predicted output. Note that we process different subtasks separately, instead of retrieving from mixed candidates.

The results, shown in the table below, demonstrate that all current LLM-based models perform worse than retrieval-based models.

alt text

We also analyzed data leakage rates, defined as the probability of obtaining identical labels using the retrieval method. For Mol-Instructions, we only consider metadata matches, ignoring differences in response phrasing, as shown in the table below. The results indicate that data leakage is prevalent in almost all benchmarks, with UniProtQA-Protein Family being the most severe case, where 97.7% of the test set can be predicted by retrieval.

alt text (Left: Leakage Rate of different datasets; Right: An example sample of data leakage.)

Based on the above findings, we propose an Out-of-Distribution (OOD) split that is based on sequence similarity and removes samples in the training set that are highly similar to those in the test set. This split is designed to mitigate data leakage issues and provide a more accurate evaluation of model performance.

OOD datasets can be downloaded from Huggingface-link.

📊 New Metrics


Problems in Existing Metrics

Here is an example of the evaluation results using ROUGE-L and BLEU metrics on a sample protein sequence:

Show Example
Ground Truth: Upon evaluating your submitted sequence, our predictive algorithms suggest the presence of: ABC transporter domains
Prediction 1
(True Answer):
The sequence you provided has been analyzed for potential protein domains or motifs. The results are: ABC transporter domains
ROUGE-L = 0.27; BLEU = 0.04
Prediction 2
(False Answer):
Upon evaluating your submitted sequence, our predictive algorithms suggest the presence of: GGDEF, MHYT, EAL domains
ROUGE-L = 0.83; BLEU = 0.73

Bold: Matched Part    Italic: Mismatched Part

It is evident that the first prediction, which is the true answer, has a low ROUGE-L and BLEU score due to the lack of exact matches in the generated text. In contrast, the second prediction, which is incorrect, achieves high scores despite containing incorrect information. This highlights the inadequacy of these metrics for evaluating protein-to-text generation tasks.

Entity-BLEU

Entity-BLEU is a metric specifically designed for biological question answering, where standard NLP metrics like ROUGE and BLEU often fail to reflect the true quality of predictions. Unlike traditional metrics that treat all tokens equally, Entity-BLEU focuses on the correct identification of biological entities, such as protein domains or enzyme names, regardless of their order in the text.

The calculation process is as follows:

  1. Entity Extraction: Biological entities are extracted from both the predicted and reference answers using a curated knowledge base
  2. BLEU Calculation: The standard BLEU score is then computed, but instead of using the raw text, it operates on the sequences of extracted entities. This makes the metric order-invariant and robust to variations in phrasing.
$$\text{Entity-BLEU} = \text{BP} \times \exp\left( \sum_{n=1}^{N} w_n \log p_n \right )$$

where BP is the brevity penalty, $w_n$ are the weights for the n-gram precision scores $p_n$, and all calculations are performed on the extracted entity sequences.

Entity-BLEU focuses on the correct identification of biological entities, providing a more accurate evaluation for protein-to-text generation tasks.

In the Prot-Inst-OOD dataset, we provide the Bio-Entity list for all answers in the "metadata". This enables direct and reliable evaluation using Entity-BLEU, as entity extraction is already performed and available for each sample.

🚀 Retrieval-Augmented Protein Modeling (RAPM)


Model Overview and Results

Show Model Overview and Results

RAPM is a retrieval-augmented method that enhances protein understanding by integrating retrieval mechanisms with language models. It retrieves relevant protein sequences from a database and uses them to inform the generation of text-based answers.

alt text

We evaluate RAPM on the Prot-Inst-OOD dataset, comparing it with various LLM-based models, including fine-tuned LLMs and retrieval-based methods. The results demonstrate that RAPM achieves superior performance in terms of both accuracy and efficiency.

alt text

Implementation

  1. Prepare environment:
  • python=3.12.7
  • torch=2.7.0+cu118
  • transformers=4.45.0
  • cuda=11.7
  • mmseqs2 (for sequence retrieval)

You can use the following commands to set up the environment:

conda create -n RAPM python=3.12.7
conda activate RAPM
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

MMseqs2 can be installed from official repository.

  1. Download the Prot-Inst-OOD dataset from Huggingface.

Put the dataset in the dataset folder, or construct your own dataset in the same format. An example dataset structure can be seen in the dataset/example_task.json.

  1. Run simple retrieval method (MMSeqs2 & ESM2 embedding retrieval):
python retrival_methods/simple_retrieval.py
  1. Run RAPM:

Build bio-knowledge database and run RAPM with the following command:

python RAPM/RAG_prompt_cons.py dataset 10 

Inference RAG_prompts with LLMs:

python RAPM/LLM_inference.py <task_name> <k>

Citation

If you find our work useful for your research and applications, please cite using this BibTeX:

@misc{wu2025rethinkingtextbasedproteinunderstanding,
      title={Rethinking Text-based Protein Understanding: Retrieval or LLM?}, 
      author={Juntong Wu and Zijing Liu and He Cao and Hao Li and Bin Feng and Zishan Shu and Ke Yu and Li Yuan and Yu Li},
      year={2025},
      eprint={2505.20354},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20354}, 
}

About

Code for paper "Rethinking Text-based Protein Understanding: Retrieval or LLM?"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages