Allergens are a major concern in protein safety, especially with the growing use of recombinant proteins in medical products. Traditional allergenicity tests are costly and time-consuming, prompting the need for efficient bioinformatics solutions. In this study, we developed an enhanced deep learning model that classifies proteins as allergenic or non-allergenic based on their sequences. Our method extracts features using two protein language models and combines them in a deep neural network, followed by ensemble modeling to improve performance. The proposed model achieved strong results: 97.91% sensitivity, 97.69% specificity, 97.80% accuracy, and a 99% AUC using five-fold cross-validation.
DOI: https://doi.org/10.1093/biomethods/bpaf040
You can try out the AllerTrans model directly available on Hugging Face Spaces: https://huggingface.co/spaces/sfaezella/AllerTrans
-
feature-extraction
- 1. ESM-v2-embeddings.ipynb: Extracts embeddings using ESM-v2 model. Input protein sequences in FASTA format.
- 2. ProtT5-embeddings.ipynb: Extracts embeddings using ProtT5 model. Input protein sequences in FASTA format.
- 3. AAC-feature-vectors.ipynb: Generates amino acid composition feature vectors. Input protein sequences in FASTA format.
-
modeling
- classic-machine-learning.ipynb: Classic machine learning models' training and evaluation, including SVM, RF, XGBoost, and KNN. This notebook also tests the effect of hyperparameter tuning and the autoencoder.
- nonlinear-DNN.ipynb: Train and evaluation of our top-performing deep neural network models, using ESM-v2 and ProtT5 embeddings, and AAC feature vectors.
- single-layer-LSTM.ipynb: Training and evaluation of a single-layer LSTM (Long Short-Term Memory) model.
- 1D-CNN.ipynb: Training and evaluation of a 1-dimensional CNN (Convolutional neural network) model.
-
model-checkpoints
- Contains saved checkpoints of the trained models required for the
nonlinear-DNN
notebook.
- Contains saved checkpoints of the trained models required for the
-
additional-experiments
- Includes supplementary experiments and analyses beyond the core modeling workflows.
-
inference-app
- Contains code for the web-based prediction tool hosted on Hugging Face Spaces.
The utilized dataset in this study is the public AlgPred 2.0 train and validation sets, which are available here.
-
Feature Extraction:
cd feature-extraction
- Run the notebooks in the
feature-extraction
folder to extract the necessary feature vectors from protein sequences. - Input protein sequences must be in FASTA format.
- Run the notebooks in the
-
Model Training and Evaluation:
cd modeling
- Open and run the
nonlinear-DNN.ipynb
notebook to train and evaluate the deep neural network model. Ensure the required model checkpoints are available in themodel-checkpoints
folder. - For other models, run the respective notebooks (
classic-machine-learning.ipynb
,single-layer-LSTM.ipynb
,1D-CNN.ipynb
).
- Open and run the