Pep2Prob Benchmark provides code and data to reproduce the experiments from
“Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS²-based Proteomics”.
Tandem mass spectrometry (MS²) identifies peptides by fragmenting precursor ions and measuring resulting spectra.
Accurate fragment ion probability models are crucial for downstream tasks, including database search, spectral library matching, de novo sequencing and other tools for peptide identification and quantification from MS² data.
Our Pep2Prob Benchmark provides:
- The first curated dataset Pep2Prob contains peptide-specific fragment probabilities, where each precursor (peptide sequence, charge state) has a vector showing the probabilities of a list of appearing fragment ions for such precursor.
- A train-test split method to prevent data leakage.
- A standardized benchmark with five baseline methods of increasing capacity: Global model, Bag of Fragment ion model, Linear regression model, Resnet, and a transformer-type model. We train these models in the Pep2Prob dataset to predict and evaluate the probability statistics of given precursors.
- 610,117 unique precursors (peptide sequence + charge)
- Constructed from 183 million high-resolution HCD MS² spectra from 227 mass spectrometry datasets in the MassIVE repository.
- 235 possible fragment ions per precursor (b-, y-ions with up to 3 charges, as well as a-ions with charge 1 at position 2)
-
Probability vectors (
$p(f|p)\in[0,1]^{235}$ ) estimated by counting the presence of the fragment ions given the precursor across repeated spectra - Train/test split avoids leakage by grouping similar sequences (identical, shared 6-mer prefix/suffix) into disjoint folds
We evaluate FIVE methods on Pep2Prob, measuring
Model | Capacity | Test L₁ ↓ | SA ↑ | Existence Accuracy ↑ |
---|---|---|---|---|
Global | global stats only | 0.244 | 0.558 | 0.699 |
BoF | + fragment sequence | 0.179 | 0.509 | 0.803 |
Linear Reg | one-hot features | 0.126 | 0.695 | 0.766 |
ResNet | 4-layer MLP | 0.069 | 0.818 | 0.871 |
Transformer | decoder-only | 0.056 | 0.845 | 0.953 |
Model capacity correlates with better capturing complex sequence-to-fragment relationships.
- Clone the repo
git clone https://github.com/Bandeira-Lab/pep2prob-benchmark.git cd pep2prob-benchmark
-
Set up environment
You can install the pytorch package with the versions that match your hardware, or use the same environment as mine using the following commands:
conda create -n pep2prob-env python==3.11
conda activate pep2prob-env
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124 (for Linux and Windows)
pip install -r requirements.txt
-
Download dataset & train-test spilt files
You can use the following commands to download our dataset from Huggingface (https://huggingface.co/datasets/bandeiralab/Pep2Prob). The dataset will be stored in data/pep2prob.
python data/download_data.py
- Running different baseline models
You can separately run the following models. The outputs and the final predictions of the models will be saved in the predictions folder. For more details of each baseline model, please check our paper.
-
Gloabal model.
python -u -m models.global.global_model
-
Bag of Fragment ion model.
python -u -m models.bag_of_fragment_ion.bof_model
-
Linear regression model.
python -u -m models.linear_regression.linear_regression_model
-
Resnet model.
python -u -m models.resnet.resnet_model
-
Transformer model. For the transformer model, you can adjust the number of epochs for training, batch size, learning rate, weight decay, and the maximal length of the peptide sequence.
python -u -m models.transformer.transformer_model \ --precursor_info_path data/pep2prob/pep2prob_dataset.csv \ --split_path data/pep2prob/train_test_split_set_1.npy \ --epochs 2 \ --batch_size 1024 \ --lr 0.001 \ --weight_decay 0.001 \ --save_prefix predictions/transformer_model_run0 \ --max_length_input 40