A Framework for deep learning transformer encoder/decoder models to predict peptide sequence from spectra. This repo is a work in progress currently
Pre trained Spec2PSM models can be found at HuggingFace Hub here: https://huggingface.co/hinklet
spec2psm
is a command-line tool for managing and running various modes of the Spec2PSM pipeline. It supports parsing spectra, training models, fine-tuning, validating models, and running inference.
spec2psm [mode] [options]
Convert a spectrum file and/or a PSM file using the provided configuration.
spec2psm convert -c <config_path> -m <mzml_paths> [-s <search_paths>] [-f <fdr_paths>] [-u <mods_path>] [-o <output_directory>]
-c
,--config_path
(required): Path to the configuration file for parsing.-m
,--mzml_paths
(required): One or more mzML file paths to convert. Provide as a space-separated list for multiple paths.-s
,--search_paths
(optional): One or more search result file paths (mzid or pepXML). Provide as a space-separated list for multiple paths. Default:None
.-f
,--fdr_paths
(optional): One or more Percolator result file paths. Requires both mzML and pepXML paths to be provided. Default:None
.-u
,--mods_path
(optional): Path to a file specifying modification settings. Default:None
.-o
,--output_directory
(optional): Directory to save the Parquet files. If not specified, the files will be saved in the same directory as the mzML files.
Train a model using the specified training and validation datasets.
spec2psm train -t <train_parquet_paths> [-v <val_parquet_paths>] [-o <output_model_name>] [-c <config_path>] [-d <device>]
-t
,--train_parquet_paths
(required): Directories or Parquet file paths for training.-v
,--val_parquet_paths
(optional): Directories or Parquet file paths for validation.-o
,--output_model_name
(optional): Path to save the trained model weights.-c
,--config_path
(required): Path to the fine-tuning configuration file. Default:None
.-d
,--device
(required): Device for training (mps
,cpu
, orgpu
). Default:None
.
Fine-tune a pre-trained model.
spec2psm tune -m <model> -t <train_parquet_paths> [-v <val_parquet_paths>] [-o <output_model_name>] [-c <config_path>] [-d <device>]
-m
,--model
(required): Path to the model weights or Hugging Face model name.-t
,--train_parquet_paths
(required): Directories or Parquet file paths for training.-v
,--val_parquet_paths
(optional): Directories or Parquet file paths for validation.-o
,--output_model_name
(optional): Path to save the fine-tuned model weights.-c
,--config_path
(required): Path to the fine-tuning configuration file. Default:None
.-d
,--device
(required): Device for fine-tuning (mps
,cpu
, orgpu
). Default:None
.
Run validation on a model.
spec2psm validate -m <model> -p <parquet_paths> [-c <config_path>] [-d <device>]
-m
,--model
(required): Path to the model weights or Hugging Face model name.-p
,--parquet_paths
(required): Directories or Parquet file paths for validation.-c
,--config_path
(required): Path to the inference configuration file. Default:None
.-d
,--device
(required): Device for validation (mps
,cpu
, orgpu
). Default:None
.
Run inference using a model.
spec2psm infer -m <model> -p <parquet_paths> [-c <config_path>] [-d <device>]
-m
,--model
(required): Path to the model weights or Hugging Face model name.-p
,--parquet_paths
(required): Directories or Parquet file paths for inference.-c
,--config_path
(required): Path to the inference configuration file. Default:None
.-d
,--device
(optional): Device for inference (mps
,cpu
, orgpu
). Default:None
.
spec2psm convert -c config.yaml -m /path/to/file1.mzML
spec2psm convert -c config.yaml -m /path/to/file1.mzML /path/to/file2.mzML
spec2psm convert -c config.yaml -m /path/to/file1.mzML -s /path/to/file1.pepXML
spec2psm convert -c config.yaml -m /path/to/file1.mzML /path/to/file2.mzML \
-s /path/to/file1.pepXML /path/to/file2.pepXML \
-f /path/to/fdr_results1.tsv /path/to/fdr_results2.tsv \
-o /path/to/output_directory -u /path/to/mods_config.txt
spec2psm train -c config.yaml -t /path/to/train_data.parquet -v /path/to/val_data.parquet -o /path/to/model_output.pt -d mps
spec2psm train -c config.yaml -t /path/to/train_data1.parquet /path/to/train_data2.parquet /path/to/train_data3.parquet -v /path/to/val_data.parquet -o /path/to/model_output.pt -d mps
spec2psm tune -c config.yaml -m huggingface-model-name -t /path/to/train_data.parquet -v /path/to/val_data.parquet -o /path/to/model_output.pt -d mps
spec2psm validate -c config.yaml -m /path/to/model.pt -p /path/to/val_data.parquet -d cpu
spec2psm infer -c config.yaml -m /path/to/model.pt -p /path/to/data.parquet -d cpu
spec2psm infer -c config.yaml -m /path/to/model.pt -p /path/to/data1.parquet /path/to/data2.parquet -d cpu
- Ensure all paths are correct and accessible.
- Select the appropriate device (
mps
,cpu
,gpu
) based on your hardware. - For more details or troubleshooting, refer to the tool's help command:
spec2psm --help
For default configuration files for small, medium, and large models see spec2psm/config For an example modification configuration file also see spec2psm/config - This is to add more modification tokens for file parsing, model training, validation, and inference
Configuration Sections
-
Tokens
-
max_peptide_length
: Maximum length of peptide sequences to be predicted. Default:62
-
spectra_length
: Maximum length of the input spectra. Default:150
-
row_group_size
: Number of rows in each group for creating or reading Parquet files, impacting file I/O performance. Default:500
-
token_map_size
: Specifies the granularity of token mapping. Options: small, medium, large. Default:medium
-
-
Model
-
batch_size
: Batch size for training and validation. Default:32
-
d_model
: Dimensionality of the model’s embedding space. Default:512
-
ff_dim
: Dimensionality of the feed-forward layer in encoder and decoder layers. Default:1024
-
dropout
: Dropout rate for regularization. Default:0.1
-
nheads
: Number of attention heads in the multi-head attention mechanism. Default:8
-
decoder_layers
: Number of decoder layers in the transformer. Default:4
-
encoder_layers
: Number of encoder layers in the transformer. Default:4
-
-
Train
-
lr
: Learning rate for optimizer. Default:1e-4
-
weight_decay
: Weight decay for regularization. Default:1e-5
-
warmup_steps
: Number of warmup steps for learning rate scheduling. Default:1000
-
total_model_steps
: Total number of steps for the learning rate decay schedule. Default:30,000,000
-
num_epochs
: Number of epochs to train. Default:1
-
metric_window
: Window size for calculating rolling averages of metrics during training. Default:100
-
plot_per_x_batches
: Frequency (in batches) of updating training metric plots. Default:1000
-
val_per_x_batches
: Frequency (in batches) of performing model validation. Default:250,000
-
Example Medium Sized Configuration snippet:
tokens:
max_peptide_length: 62 # Max length of peptides to be predicted
spectra_length: 150 # Max length of an input spectra
row_group_size: 500 # Parameter used for creating / reading parquet files
token_map_size: "medium" # Set to small, medium, or large - Each setting adds more or less modifications
model:
batch_size: 32 # Batch size for training and validation
d_model: 512 # Transformer model dimension
ff_dim: 1024 # Feed forward model dimension for the encoder / decoder layers
dropout: 0.1 # Dropout percentage
nheads: 8 # Number of attention heads
decoder_layers: 4 # Number of decoder layers
encoder_layers: 4 # Number of encoder layers
train:
lr: 1e-4 # Learning rate
weight_decay: 1e-5 # Weight decay
warmup_steps: 1000 # Number of warmup steps
total_model_steps: 30_000_000 # Total model steps (Used for the learning rate decay)
num_epochs: 1 # Number of epochs to train
metric_window: 100 # The window used for calculating rolling averages for metrics
plot_per_x_batches: 1000 # How many batches to update metric plots
val_per_x_batches: 250000 # How many batches to perform model validation
TODO