Pep2Prob Benchmark

Pep2Prob Benchmark provides code and data to reproduce the experiments from
“Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS²-based Proteomics”.

📖 Overview

Tandem mass spectrometry (MS²) identifies peptides by fragmenting precursor ions and measuring resulting spectra.
Accurate fragment ion probability models are crucial for downstream tasks, including database search, spectral library matching, de novo sequencing and other tools for peptide identification and quantification from MS² data.

Our Pep2Prob Benchmark provides:

The first curated dataset Pep2Prob contains peptide-specific fragment probabilities, where each precursor (peptide sequence, charge state) has a vector showing the probabilities of a list of appearing fragment ions for such precursor.
A train-test split method to prevent data leakage.
A standardized benchmark with five baseline methods of increasing capacity: Global model, Bag of Fragment ion model, Linear regression model, Resnet, and a transformer-type model. We train these models in the Pep2Prob dataset to predict and evaluate the probability statistics of given precursors.

🗃️ Dataset

610,117 unique precursors (peptide sequence + charge)
Constructed from 183 million high-resolution HCD MS² spectra from 227 mass spectrometry datasets in the MassIVE repository.
235 possible fragment ions per precursor (b-, y-ions with up to 3 charges, as well as a-ions with charge 1 at position 2)
Probability vectors ($p(f|p)\in[0,1]^{235}$) estimated by counting the presence of the fragment ions given the precursor across repeated spectra
Train/test split avoids leakage by grouping similar sequences (identical, shared 6-mer prefix/suffix) into disjoint folds

⚙️ Benchmark

We evaluate FIVE methods on Pep2Prob, measuring $L_1$ loss, MSE, spectral angle (SA), and fragment-existence metrics:

Model	Capacity	Test L₁ ↓	SA ↑	Existence Accuracy ↑
Global	global stats only	0.244	0.558	0.699
BoF	+ fragment sequence	0.179	0.509	0.803
Linear Reg	one-hot features	0.126	0.695	0.766
ResNet	4-layer MLP	0.069	0.818	0.871
Transformer	decoder-only	0.056	0.845	0.953

Model capacity correlates with better capturing complex sequence-to-fragment relationships.

🛠️ Installation

Clone the repo

git clone https://github.com/Bandeira-Lab/pep2prob-benchmark.git
cd pep2prob-benchmark

⚙️ Usage steps

Set up environment

You can install the pytorch package with the versions that match your hardware, or use the same environment as mine using the following commands:

conda create -n pep2prob-env python==3.11
conda activate pep2prob-env

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124  (for Linux and Windows)

pip install -r requirements.txt

Download dataset & train-test spilt files

You can use the following commands to download our dataset from Huggingface (https://huggingface.co/datasets/bandeiralab/Pep2Prob). The dataset will be stored in data/pep2prob.

python data/download_data.py

Running different baseline models

You can separately run the following models. The outputs and the final predictions of the models will be saved in the predictions folder. For more details of each baseline model, please check our paper.

Gloabal model.
```
python -u -m models.global.global_model
```

Bag of Fragment ion model.

python -u -m models.bag_of_fragment_ion.bof_model

Linear regression model.

python -u -m models.linear_regression.linear_regression_model

Resnet model.
```
python -u -m models.resnet.resnet_model
```

Transformer model. For the transformer model, you can adjust the number of epochs for training, batch size, learning rate, weight decay, and the maximal length of the peptide sequence.

python -u -m models.transformer.transformer_model \
  --precursor_info_path data/pep2prob/pep2prob_dataset.csv \
  --split_path          data/pep2prob/train_test_split_set_1.npy \
  --epochs              2 \
  --batch_size          1024 \
  --lr                  0.001 \
  --weight_decay        0.001 \
  --save_prefix         predictions/transformer_model_run0 \
  --max_length_input    40

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
additional experiments		additional experiments
data		data
models		models
pep2prob		pep2prob
predictions		predictions
tutorial		tutorial
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pep2Prob Benchmark

📖 Overview

🗃️ Dataset

⚙️ Benchmark

🛠️ Installation

⚙️ Usage steps

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Bandeira-Lab/pep2prob-benchmark

Folders and files

Latest commit

History

Repository files navigation

Pep2Prob Benchmark

📖 Overview

🗃️ Dataset

⚙️ Benchmark

🛠️ Installation

⚙️ Usage steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages