Skip to content

gabrielbianchin/SUPERMAGO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

SUPERMAGO: Protein Function Prediction based on Transformers Embeddings

The paper is available here.

Description

SUPERMAGO is a machine learning-based approach designed for protein function prediction using embeddings generated by Transformer-based models, multilayer perceptrons trained on these embeddings, and a stacking classifier. SUPERMAGO+ is an ensemble method that combines predictions from SUPERMAGO and DIAMOND, a local alignment tool. Both approaches predict protein function for the Biological Process Ontology (BPO), Cellular Component Ontology (CCO), and Molecular Function Ontology (MFO).

Instalation

To install and set up SUPERMAGO and SUPERMAGO+, follow the steps below:

  1. Clone the repository:
git clone https://github.com/your-username/supermago.git
cd supermago
  1. Install the dependencies:
pip install -r requirements.txt

Dataset

The dataset for this work is available here. The IC values used in evaluation is available here.

Models

Our layer-base models for each ontology are available here.

Predictions

The predictions of SUPERMAGO and SUPERMAGO+ are available here.

Reproducibility

  • Navigate to src folder and run setup.py.
  • Download the dataset and IC values, and place them into base folder.
  • In the src folder, run the following command:
python main.py --ont ontology

where ontology can be bp, cc, or mf for Biological Process, Cellular Component and Molecular Function, respectively.

main.py executes the pipeline of SUPERMAGO and SUPERMAGO+ as follows:

  • extract.py extracts the embeddings given the model name (esm or t5) and ontology (bp, cc or mf).
  • layer_classificaton.py runs the neural network for a specific layer (36, 35, 34, 33, 32 for ESM2 T36; 24, 23, 22, 21, 20 for ProtT5) and ontology (bp, cc or mf).
  • stacking.py runs the stacking model for a specific ontology (bp, cc or mf) and generates the prediction of SUPERMAGO.
  • diamond.py runs DIAMOND for a specific ontology (bp, cc or mf).
  • ensemble.py generates the final prediction of SUPERMAGO+ for a specific ontology (bp, cc or mf).
  • evaluate.py evaluates the predictions.

Dataset Adaptation

If you need to run SUPERMAGO and SUPERMAGO+ on your own dataset, you must create a dataset with the same structure as ours. This includes a CSV file for each ontology, with the first column containing the protein ID, the second column containing the protein sequence, and the remaining columns containing terms in one-hot encoding format. You should also calculate the IC values for evaluation and save it in a csv file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages