Skip to content

ai4protein/VenusVaccine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

39 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

VenusVaccine

GitHub

Python PyTorch License: CC-BY-NC-ND-4.0

๐Ÿ“‹ Overview

VenusVaccine is a deep learning-based immunogenicity prediction tool focused on the classification of protective antigen or non-protective antigen. The project leverages advanced pre-trained language models and adapter architectures to interpret immunogenicity based on the multimodal encoding of antigens, including their sequences, structures, and physico-chemical properties.

VenusVaccine Architecture

๐ŸŒŸ Key Features

  • ๐Ÿ”ฌ Versatile Data Processing

    • Support for multiple protein database formats
    • Efficient data preprocessing and feature extraction
    • Flexible data augmentation strategies
  • ๐Ÿงฌ Protein Feature Extraction

    • E-descriptor and Z-descriptor physicochemical features
    • Foldseek secondary structure prediction
    • ESM3 structure sequence encoding
  • ๐Ÿค– Advanced Model Architecture

    • Integration with pre-trained protein language models
    • Innovative adapter design
    • Support for multiple PLM types (ESM, Bert, AnKh etc.)
  • ๐Ÿ“Š Comprehensive Training Framework

    • Cross-validation support
    • Early stopping strategy
    • Wandb experiment tracking
    • Automated model evaluation
  • ๐Ÿš€ High-Performance Computing

    • GPU acceleration support
    • Distributed training
    • Gradient accumulation optimization

๐Ÿ› ๏ธ Installation Guide

Requirements

  • Python 3.7+
  • CUDA 11.0+ (for GPU training)
  • 8GB+ RAM

Setup Steps

  1. Clone the repository:
git clone https://github.com/songleee/VenusVaccine.git
cd VenusVaccine
  1. Create a virtual environment:
conda env create -f environment.yaml
  1. Download data and checkpoints: Download the pre-trained model files, training data, and model evaluation results from Google Drive

Pre-trained model files should be placed in the ckpt directory:

  • ckpt/Bacteria.pt: Model for bacterial protective antigens
  • ckpt/Virus.pt: Model for viral protective antigens
  • ckpt/Tumor.pt: Model for tumor protective antigens
  1. Download and install dependencies:
wget https://huggingface.co/EvolutionaryScale/esm3-sm-open-v1/blob/main/data/weights/esm3_structure_encoder_v0.pth
mkdir -p ./src/data/weights
mv esm3_structure_encoder_v0.pth ./src/data/weights

๐Ÿ“Š Data Processing

Predict single protein sequence

# Predict single protein sequence
python src/esmfold.py --sequence "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG" --out_file output.pdb

# Predict multiple proteins from FASTA file
python src/esmfold.py --fasta_file proteins.fasta --out_dir pdb_structures --fold_chunk_size 128

PDB to JSON Conversion

Make sure you have got the PDB file (cryo-EM structure or predicted by AF2 or ESMFold) of interest protein first, and use pdb2json.py to convert PDB files to a feature-rich JSON format:

python pdb2json.py <pdb_dir> <output_json_file>

This tool automatically extracts:

  • Amino acid sequence
  • ESM3 structure sequence
  • Foldseek secondary structure prediction
  • E-descriptor (5-dimensional) features
  • Z-descriptor (3-dimensional) features

๐Ÿš€ Quick Start

Basic Usage

python infer.py -i input.json -t Bacteria

Command Line Arguments

python infer.py [-h] -i INPUT -t {Bacteria,Virus,Tumor} [--structure_seqs STRUCTURE_SEQS] 
                [--max_seq_len MAX_SEQ_LEN] [--max_batch_token MAX_BATCH_TOKEN] 
                [--num_workers NUM_WORKERS] [-o OUTPUT]

Arguments:

  • -i, --input: Path to input JSON file (required)
  • -t, --type: Pathogen type, choose from: Bacteria, Virus, Tumor (required)
  • --structure_seqs: Types of structure sequences, comma-separated (default: e_descriptor,z_descriptor,foldseek_seq,esm3_structure_seq)
  • --max_seq_len: Maximum sequence length (default: 1024)
  • --max_batch_token: Maximum tokens per batch (default: 10000)
  • --num_workers: Number of data loading workers (default: 4)
  • -o, --output: Path to output CSV file (default: results_{type}.csv)

Input Format

The input should be a JSON file with one sample per line. Fields required depend on the specified structure_seqs parameter:

{
    "name": "protein1",
    "aa_seq": "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG",
    "foldseek_seq": "HHHEEELLCCHHHHHHHHHHHHSTTHHHHHHHHHHHHHHHHHHHHHHHHEETTEEHHHHHH",
    "esm3_structure_seq": [1, 2, 3, \...],
    "e_descriptor": [[0.1, 0.2, 0.3, 0.4, 0.5], \...],
    "z_descriptor": [[0.1, 0.2, 0.3], \...]
}

Required fields:

  • name: Protein sequence identifier
  • aa_seq: Amino acid sequence

Optional fields (depending on structure_seqs parameter):

  • foldseek_seq: Secondary structure sequence predicted by Foldseek
  • esm3_structure_seq: Structure sequence predicted by ESM3
  • e_descriptor: E-descriptor features (5-dimensional)
  • z_descriptor: Z-descriptor features (3-dimensional)

Output Format

The output is a CSV file containing:

  • name: Protein sequence identifier
  • aa_seq: Amino acid sequence
  • pred_label: Prediction label (0: non-protective antigen, 1: protective antigen)
  • pred_proba: Prediction probability of being a protective antigen

Examples

  1. Predict using all structural features:
python infer.py -i proteins.json -t Bacteria
  1. Use only specific structural features:
python infer.py -i proteins.json -t Virus --structure_seqs "e_descriptor,z_descriptor"
  1. Specify output file:
python infer.py -i proteins.json -t Tumor -o predictions.csv
  1. Adjust sequence length and batch size:
python infer.py -i proteins.json -t Bacteria --max_seq_len 512 --max_batch_token 5000

โš ๏ธ Important Notes

  1. Ensure all required dependencies are installed
  2. Make sure corresponding model files exist in the ckpt directory (Bacteria.pt, Virus.pt, or Tumor.pt)
  3. Make sure the PLM checkpoints downloaded from huggingface are set up correctly if the network failed
  4. GPU is recommended for better inference performance

๐Ÿ“ Citation

If you find this tool helpful, please cite our work:

@inproceedings{
li2025immunogenicity,
title={Immunogenicity Prediction with Dual Attention Enables Vaccine Target Selection},
author={Song Li and Yang Tan and Song Ke and Liang Hong and Bingxin Zhou},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=hWmwL9gizZ}
}

๐Ÿ“ License

This project is licensed under the terms of theย CC-BY-NC-ND-4.0ย license.

๐Ÿ“ฎ Contact


โญ๏ธ If you find this project helpful, please give it a star!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published