VenusVaccine is a deep learning-based immunogenicity prediction tool focused on the classification of protective antigen or non-protective antigen. The project leverages advanced pre-trained language models and adapter architectures to interpret immunogenicity based on the multimodal encoding of antigens, including their sequences, structures, and physico-chemical properties.
-
๐ฌ Versatile Data Processing
- Support for multiple protein database formats
- Efficient data preprocessing and feature extraction
- Flexible data augmentation strategies
-
๐งฌ Protein Feature Extraction
- E-descriptor and Z-descriptor physicochemical features
- Foldseek secondary structure prediction
- ESM3 structure sequence encoding
-
๐ค Advanced Model Architecture
- Integration with pre-trained protein language models
- Innovative adapter design
- Support for multiple PLM types (ESM, Bert, AnKh etc.)
-
๐ Comprehensive Training Framework
- Cross-validation support
- Early stopping strategy
- Wandb experiment tracking
- Automated model evaluation
-
๐ High-Performance Computing
- GPU acceleration support
- Distributed training
- Gradient accumulation optimization
- Python 3.7+
- CUDA 11.0+ (for GPU training)
- 8GB+ RAM
- Clone the repository:
git clone https://github.com/songleee/VenusVaccine.git
cd VenusVaccine
- Create a virtual environment:
conda env create -f environment.yaml
- Download data and checkpoints: Download the pre-trained model files, training data, and model evaluation results from Google Drive
Pre-trained model files should be placed in the ckpt
directory:
ckpt/Bacteria.pt
: Model for bacterial protective antigensckpt/Virus.pt
: Model for viral protective antigensckpt/Tumor.pt
: Model for tumor protective antigens
- Download and install dependencies:
wget https://huggingface.co/EvolutionaryScale/esm3-sm-open-v1/blob/main/data/weights/esm3_structure_encoder_v0.pth
mkdir -p ./src/data/weights
mv esm3_structure_encoder_v0.pth ./src/data/weights
# Predict single protein sequence
python src/esmfold.py --sequence "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG" --out_file output.pdb
# Predict multiple proteins from FASTA file
python src/esmfold.py --fasta_file proteins.fasta --out_dir pdb_structures --fold_chunk_size 128
Make sure you have got the PDB file (cryo-EM structure or predicted by AF2 or ESMFold) of interest protein first, and use pdb2json.py
to convert PDB files to a feature-rich JSON format:
python pdb2json.py <pdb_dir> <output_json_file>
This tool automatically extracts:
- Amino acid sequence
- ESM3 structure sequence
- Foldseek secondary structure prediction
- E-descriptor (5-dimensional) features
- Z-descriptor (3-dimensional) features
python infer.py -i input.json -t Bacteria
python infer.py [-h] -i INPUT -t {Bacteria,Virus,Tumor} [--structure_seqs STRUCTURE_SEQS]
[--max_seq_len MAX_SEQ_LEN] [--max_batch_token MAX_BATCH_TOKEN]
[--num_workers NUM_WORKERS] [-o OUTPUT]
Arguments:
-i, --input
: Path to input JSON file (required)-t, --type
: Pathogen type, choose from: Bacteria, Virus, Tumor (required)--structure_seqs
: Types of structure sequences, comma-separated (default: e_descriptor,z_descriptor,foldseek_seq,esm3_structure_seq)--max_seq_len
: Maximum sequence length (default: 1024)--max_batch_token
: Maximum tokens per batch (default: 10000)--num_workers
: Number of data loading workers (default: 4)-o, --output
: Path to output CSV file (default: results_{type}.csv)
The input should be a JSON file with one sample per line. Fields required depend on the specified structure_seqs parameter:
{
"name": "protein1",
"aa_seq": "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
"foldseek_seq": "HHHEEELLCCHHHHHHHHHHHHSTTHHHHHHHHHHHHHHHHHHHHHHHHEETTEEHHHHHH",
"esm3_structure_seq": [1, 2, 3, \...],
"e_descriptor": [[0.1, 0.2, 0.3, 0.4, 0.5], \...],
"z_descriptor": [[0.1, 0.2, 0.3], \...]
}
Required fields:
name
: Protein sequence identifieraa_seq
: Amino acid sequence
Optional fields (depending on structure_seqs parameter):
foldseek_seq
: Secondary structure sequence predicted by Foldseekesm3_structure_seq
: Structure sequence predicted by ESM3e_descriptor
: E-descriptor features (5-dimensional)z_descriptor
: Z-descriptor features (3-dimensional)
The output is a CSV file containing:
name
: Protein sequence identifieraa_seq
: Amino acid sequencepred_label
: Prediction label (0: non-protective antigen, 1: protective antigen)pred_proba
: Prediction probability of being a protective antigen
- Predict using all structural features:
python infer.py -i proteins.json -t Bacteria
- Use only specific structural features:
python infer.py -i proteins.json -t Virus --structure_seqs "e_descriptor,z_descriptor"
- Specify output file:
python infer.py -i proteins.json -t Tumor -o predictions.csv
- Adjust sequence length and batch size:
python infer.py -i proteins.json -t Bacteria --max_seq_len 512 --max_batch_token 5000
- Ensure all required dependencies are installed
- Make sure corresponding model files exist in the
ckpt
directory (Bacteria.pt
,Virus.pt
, orTumor.pt
) - Make sure the PLM checkpoints downloaded from huggingface are set up correctly if the network failed
- GPU is recommended for better inference performance
If you find this tool helpful, please cite our work:
@inproceedings{
li2025immunogenicity,
title={Immunogenicity Prediction with Dual Attention Enables Vaccine Target Selection},
author={Song Li and Yang Tan and Song Ke and Liang Hong and Bingxin Zhou},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=hWmwL9gizZ}
}
This project is licensed under the terms of theย CC-BY-NC-ND-4.0ย license.
- Project Maintainer: Song Li, Yang Tan
- Email: songlee@sjtu.edu.cn
- Issue Tracking: Issue Page