SIMBA is a transformer-based neural network that accurately predicts chemical structural similarity from tandem mass spectrometry (MS/MS) spectra. Unlike traditional methods relying on heuristic metrics (e.g., modified cosine similarity), SIMBA directly models structural differences, enabling precise analog identification in metabolomics.
SIMBA predicts two interpretable metrics:
- Substructure Edit Distance: Number of molecular graph edits required to convert one molecule into another.
- Maximum Common Edge Substructure (MCES) Distance: Number of bond modifications required to achieve molecular equivalence.
- Python 3.11.7
- Conda
Create and activate the environment:
conda env create -f environment.yml
conda activate simba
Install the module:
pip install -e .
Note for macOS users:
brew install xz
We provide a pretrained SIMBA model trained on spectra from MassSpecGym. The model operates in positive ionization mode for protonated adducts.
Follow the Run Inference Notebook for a comprehensive tutorial:
- Runtime: < 10 minutes (including model/data download)
- Example data: data folder.
- Supported format:
.mgf
Using an Apple M3 Pro (36 GB RAM):
- Embedding computation: ~100,000 spectra in ~1 minute
- Similarity computation: 1 query vs. 100,000 spectra in ~10 seconds
SIMBA caches computed embeddings, significantly speeding repeated library searches.
Modern metabolomics relies on tandem mass spectrometry (MS/MS) to identify unknown compounds by comparing their spectra against large reference libraries. SIMBA enables analog discovery—finding structurally related molecules—by predicting the 2 complementary, interpretable metrics directly from spectra.
The notebook Run Analog Discovery Notebook presents an analog discovery task based on the MassSpecGym dataset and CASMI2022 dataset.
The notebook shows how to:
-
Load a pretrained SIMBA model and MS/MS data.
-
Compute distance matrices between query and reference spectra.
-
Extract top analogs for a given query.
-
Compare predictions against ground truth and visualize the best match.
SIMBA supports training custom models using your own MS/MS datasets in .mgf
format.
Run the script below to generate training data:
python preprocessing_scripts/final_generation_data.py \
--spectra_path=/path/to/your/spectra.mgf \
--workspace=/path/to/output_dir/ \
--MAX_SPECTRA_TRAIN=100 \
--mapping_file_name=mapping_unique_smiles.pkl \
--PREPROCESSING_NUM_WORKERS=0
- Numpy arrays with indexes and structural similarity metrics
- Pickle file (
mapping_unique_smiles.pkl
) mapping spectra indexes to SMILES structures
import pickle
with open('/path/to/output_dir/mapping_unique_smiles.pkl', 'rb') as f:
data = pickle.load(f)
mol_train = data['molecule_pairs_train']
print(mol_train.df_smiles)
Train your SIMBA model:
python training_scripts/final_training.py \
--CHECKPOINT_DIR=/path/to/checkpoints/ \
--PREPROCESSING_DIR_TRAIN=/path/to/preprocessed_data/ \
--TRAINING_NUM_WORKERS=0 \
--ACCELERATOR=cpu \
--EPOCHS=100
The best-performing model (lowest validation loss) is saved in CHECKPOINT_DIR
.
- Code repository: SIMBA GitHub
- For questions, issues, or feature requests, please open an issue.
- SIMBA Paper: [INSERT PAPER LINK or DOI]
- Training and testing datasets available at: [https://zenodo.org/records/15275257].