The paper is available here.
SUPERMAGO is a machine learning-based approach designed for protein function prediction using embeddings generated by Transformer-based models, multilayer perceptrons trained on these embeddings, and a stacking classifier. SUPERMAGO+ is an ensemble method that combines predictions from SUPERMAGO and DIAMOND, a local alignment tool. Both approaches predict protein function for the Biological Process Ontology (BPO), Cellular Component Ontology (CCO), and Molecular Function Ontology (MFO).
To install and set up SUPERMAGO and SUPERMAGO+, follow the steps below:
- Clone the repository:
git clone https://github.com/your-username/supermago.git
cd supermago
- Install the dependencies:
pip install -r requirements.txt
The dataset for this work is available here. The IC values used in evaluation is available here.
Our layer-base models for each ontology are available here.
The predictions of SUPERMAGO and SUPERMAGO+ are available here.
- Navigate to
src
folder and runsetup.py
. - Download the dataset and IC values, and place them into
base
folder. - In the
src
folder, run the following command:
python main.py --ont ontology
where ontology can be bp
, cc
, or mf
for Biological Process, Cellular Component and Molecular Function, respectively.
main.py
executes the pipeline of SUPERMAGO and SUPERMAGO+ as follows:
extract.py
extracts the embeddings given the model name (esm
ort5
) and ontology (bp
,cc
ormf
).layer_classificaton.py
runs the neural network for a specific layer (36
,35
,34
,33
,32
for ESM2 T36;24
,23
,22
,21
,20
for ProtT5) and ontology (bp
,cc
ormf
).stacking.py
runs the stacking model for a specific ontology (bp
,cc
ormf
) and generates the prediction of SUPERMAGO.diamond.py
runs DIAMOND for a specific ontology (bp
,cc
ormf
).ensemble.py
generates the final prediction of SUPERMAGO+ for a specific ontology (bp
,cc
ormf
).evaluate.py
evaluates the predictions.
If you need to run SUPERMAGO and SUPERMAGO+ on your own dataset, you must create a dataset with the same structure as ours. This includes a CSV file for each ontology, with the first column containing the protein ID, the second column containing the protein sequence, and the remaining columns containing terms in one-hot encoding format. You should also calculate the IC values for evaluation and save it in a csv file.