Skip to content

faezesarlakifar/AllerTrans

Repository files navigation

AllerTrans AllerTrans Code Ocean License

AllerTrans

A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Overview

Allergens are a major concern in protein safety, especially with the growing use of recombinant proteins in medical products. Traditional allergenicity tests are costly and time-consuming, prompting the need for efficient bioinformatics solutions. In this study, we developed an enhanced deep learning model that classifies proteins as allergenic or non-allergenic based on their sequences. Our method extracts features using two protein language models and combines them in a deep neural network, followed by ensemble modeling to improve performance. The proposed model achieved strong results: 97.91% sensitivity, 97.69% specificity, 97.80% accuracy, and a 99% AUC using five-fold cross-validation.

DOI: https://doi.org/10.1093/biomethods/bpaf040

Online Prediction Tool

You can try out the AllerTrans model directly available on Hugging Face Spaces: https://huggingface.co/spaces/sfaezella/AllerTrans

A comprehensive flowchart that includes all of our experiments

Experiments' Flowchart

Repository Structure

  • feature-extraction

  • modeling

    • classic-machine-learning.ipynb: Classic machine learning models' training and evaluation, including SVM, RF, XGBoost, and KNN. This notebook also tests the effect of hyperparameter tuning and the autoencoder.
    • nonlinear-DNN.ipynb: Train and evaluation of our top-performing deep neural network models, using ESM-v2 and ProtT5 embeddings, and AAC feature vectors.
    • single-layer-LSTM.ipynb: Training and evaluation of a single-layer LSTM (Long Short-Term Memory) model.
    • 1D-CNN.ipynb: Training and evaluation of a 1-dimensional CNN (Convolutional neural network) model.
  • model-checkpoints

    • Contains saved checkpoints of the trained models required for the nonlinear-DNN notebook.
  • additional-experiments

    • Includes supplementary experiments and analyses beyond the core modeling workflows.
  • inference-app

    • Contains code for the web-based prediction tool hosted on Hugging Face Spaces.

General AllerTrans Model Architecture

Model Architecture

Dataset

The utilized dataset in this study is the public AlgPred 2.0 train and validation sets, which are available here.

Usage

  1. Feature Extraction:

    cd feature-extraction
    • Run the notebooks in the feature-extraction folder to extract the necessary feature vectors from protein sequences.
    • Input protein sequences must be in FASTA format.
  2. Model Training and Evaluation:

     cd modeling
    • Open and run the nonlinear-DNN.ipynb notebook to train and evaluate the deep neural network model. Ensure the required model checkpoints are available in the model-checkpoints folder.
    • For other models, run the respective notebooks (classic-machine-learning.ipynb, single-layer-LSTM.ipynb, 1D-CNN.ipynb).

About

A Deep Learning Method for Predicting the Allergenicity of Protein Sequences

Topics

Resources

License

Stars

Watchers

Forks