Skip to content

bittremieux-lab/simba

Repository files navigation

SIMBA: Spectral Identification of Molecule Bio-Analogues

SIMBA is a transformer-based neural network that accurately predicts chemical structural similarity from tandem mass spectrometry (MS/MS) spectra. Unlike traditional methods relying on heuristic metrics (e.g., modified cosine similarity), SIMBA directly models structural differences, enabling precise analog identification in metabolomics.

SIMBA predicts two interpretable metrics:

  1. Substructure Edit Distance: Number of molecular graph edits required to convert one molecule into another.
  2. Maximum Common Edge Substructure (MCES) Distance: Number of bond modifications required to achieve molecular equivalence.

🚀 Quickstart

Requirements

Installation (10–20 minutes)

Create and activate the environment:

conda env create -f environment.yml
conda activate simba

Install the module:

pip install -e .

Note for macOS users:

brew install xz

🔎 Computing Structural Similarities

We provide a pretrained SIMBA model trained on spectra from MassSpecGym. The model operates in positive ionization mode for protonated adducts.

Usage Example

Follow the Run Inference Notebook for a comprehensive tutorial:

  • Runtime: < 10 minutes (including model/data download)
  • Example data: data folder.
  • Supported format: .mgf

Performance

Using an Apple M3 Pro (36 GB RAM):

  • Embedding computation: ~100,000 spectra in ~1 minute
  • Similarity computation: 1 query vs. 100,000 spectra in ~10 seconds

SIMBA caches computed embeddings, significantly speeding repeated library searches.


Analog discovery using SIMBA

Modern metabolomics relies on tandem mass spectrometry (MS/MS) to identify unknown compounds by comparing their spectra against large reference libraries. SIMBA enables analog discovery—finding structurally related molecules—by predicting the 2 complementary, interpretable metrics directly from spectra.

The notebook Run Analog Discovery Notebook presents an analog discovery task based on the MassSpecGym dataset and CASMI2022 dataset.

The notebook shows how to:

  • Load a pretrained SIMBA model and MS/MS data.

  • Compute distance matrices between query and reference spectra.

  • Extract top analogs for a given query.

  • Compare predictions against ground truth and visualize the best match.

📚 Training Your Custom SIMBA Model

SIMBA supports training custom models using your own MS/MS datasets in .mgf format.

Step 1: Generate Training Data

Run the script below to generate training data:

python preprocessing_scripts/final_generation_data.py \
  --spectra_path=/path/to/your/spectra.mgf \
  --workspace=/path/to/output_dir/ \
  --MAX_SPECTRA_TRAIN=100 \
  --mapping_file_name=mapping_unique_smiles.pkl \
  --PREPROCESSING_NUM_WORKERS=0

Output

  • Numpy arrays with indexes and structural similarity metrics
  • Pickle file (mapping_unique_smiles.pkl) mapping spectra indexes to SMILES structures

Accessing Data Mapping

import pickle

with open('/path/to/output_dir/mapping_unique_smiles.pkl', 'rb') as f:
    data = pickle.load(f)

mol_train = data['molecule_pairs_train']
print(mol_train.df_smiles)

Step 2: Model Training

Train your SIMBA model:

python training_scripts/final_training.py \
  --CHECKPOINT_DIR=/path/to/checkpoints/ \
  --PREPROCESSING_DIR_TRAIN=/path/to/preprocessed_data/ \
  --TRAINING_NUM_WORKERS=0 \
  --ACCELERATOR=cpu \
  --EPOCHS=100

The best-performing model (lowest validation loss) is saved in CHECKPOINT_DIR.


📬 Contact & Support


📖 References

  • SIMBA Paper: [INSERT PAPER LINK or DOI]

📦 Data Availability


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •