CD-GPT: Biological Foundation Model at Full-molecular Level

This repo contains the code and models for "CD-GPT: A Biological Foundation Model Bridging the Gap between Molecular Sequences Through Central Dogma".

Model

CD-GPT is a generative biological foundation model aiming to capture the intricate system-wide molecular interactions in biological systems. Through pretraining on full molecular level data including DNA, RNA and protein sequences, we demonstrate that CD-GPT can efficiently handle a series of downstream tasks, including prediction and generation tasks across mono-molecular and multi-molecular analyses.

We have released the following checkpoints:

Checkpoint	Description
CD-GPT-1b	Model pretrained through Stage 1 (Mono-sequence Pretrain) and Stage 2 (Central Dogma Pretrain).
CD-GPT-1b-s	Model pretrained through Stage 1 (Mono-sequence Pretrain), Stage 2 (Central Dogma Pretrain) and Stage 3 (Protein Structure Pretrain).
CD-GPT-1b-reverse-translation	Model finetuned on translation-related pair sequences. Can be used to generate codon sequence from protein.

You can download the weights from:

Results

CD-GPT achieves SOTA performance on a series of downstream tasks.

DNA Promoter Detection

Model	CD-GPT-1b	NT	DNABERT-2	HyenaDNA	Evo
MCC	🥇0.905	0.8771	🥈0.8831	0.4738	0.835

DNA Splice Site Prediction

Model	CD-GPT-1b	NT	DNABERT-2	HyenaDNA
MCC	🥇0.894	0.7991	🥈0.8593	0.7267

Protein Solubility Prediction

Model	CD-GPT-1b	CD-GPT-1b-s	Transformer	LSTM	CNN	ResNet	ProtBERT	ESM
Acc	🥈72.48	🥇75.8	70.12	70.18	64.43	67.33	68.15	70.23

Protein Secondary Structure Prediction

Model	CD-GPT-1b-s	Transformer	LSTM	CNN	ResNet	ProtBERT	ESM
Acc	🥇90.83	59.62	68.99	66.07	69.56	82.18	🥈82.73

Protein Contact Map Prediction

Model	CD-GPT-1b-s	Transformer	LSTM	CNN	ResNet	ProtBERT	ESM
P@L/5	🥇57.29	17.5	26.34	10	20.43	39.66	🥈45.78

RNA Protein Interaction Prediction

for RPI-369:

Model	CD-GPT-1b	lncPro	RPISeq	IPMiner	RPITER
MCC	🥇0.5224	0.009	0.426	0.428	🥈0.461
Acc	🥇76.05	50.2	71.3	70	🥈72.8
Pre	🥈77.98	51.2	72.4	🥇84	70.1

for RPI-488:

Model	CD-GPT-1b	lncPro	RPISeq	IPMiner	RPITER
MCC	🥇0.8204	0.725	0.771	🥈0.793	🥈0.793
Acc	🥇90.8	85.6	88.3	🥈89.3	🥈89.3
Pre	🥇95.54	94	93.5	🥈95.1	94.3

Setup

# create a virtual environment
conda create -n cdgpt
conda activate cdgpt
# clone repo and install requirements
git clone https://github.com/TencentAI4S/CD-GPT.git
cd CD-GPT/
pip install -r requirements.txt

Getting Start

Download Model Checkpoint

# download checkpoints and tokenizer and put them under this directory
mkdir checkpoints

Generation

For generation purposes like translation or reverse translation, you can refer to generate_examply.ipynb for guidance.

Prediction

Equipped with output heads, CD-GPT can be applied to different types of downstream tasks. Currently released checkpoints do not include the weight of output head, so the output would be a random guess.

Sequence Prediction Task

python predict_example.py \
    --model checkpoints/CD-GPT-1b.pth \
    --tokenizer checkpoints/tokenizer.model \
    --head sequence \
    --num_classes 2

Token Prediction Task

python predict_example.py \
    -model checkpoints/CD-GPT-1b.pth \
    --tokenizer checkpoints/tokenizer.model \
    --head token \
    --num_classes 2

Residue-Pair Prediction Task

python predict_example.py \
    -model checkpoints/CD-GPT-1b.pth \
    --tokenizer checkpoints/tokenizer.model \
    --head residuepair \
    --num_classes 2

Finetune CD-GPT on your own datasets

We will provide tutorial of finetuning CD-GPT on your datasets in the future.

Citation

If you use CD-GPT in your research, please cite our paper

@article{zhu2024cd,
  title={CD-GPT: A Biological Foundation Model Bridging the Gap between Molecular Sequences Through Central Dogma},
  author={Zhu, Xiao and Qin, Chenchen and Wang, Fang and Yang, Fan and He, Bing and Zhao, Yu and Yao, Jianhua},
  journal={bioRxiv},
  pages={2024--06},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
config		config
model		model
tokenizer		tokenizer
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README-zh.md		README-zh.md
README.md		README.md
example.fasta		example.fasta
generate_example.ipynb		generate_example.ipynb
predict_example.py		predict_example.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CD-GPT: Biological Foundation Model at Full-molecular Level

Contents

Model

Results

DNA Promoter Detection

DNA Splice Site Prediction

Protein Solubility Prediction

Protein Secondary Structure Prediction

Protein Contact Map Prediction

RNA Protein Interaction Prediction

Setup

Getting Start

Download Model Checkpoint

Generation

Prediction

Finetune CD-GPT on your own datasets

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

TencentAI4S/CD-GPT

Folders and files

Latest commit

History

Repository files navigation

CD-GPT: Biological Foundation Model at Full-molecular Level

Contents

Model

Results

DNA Promoter Detection

DNA Splice Site Prediction

Protein Solubility Prediction

Protein Secondary Structure Prediction

Protein Contact Map Prediction

RNA Protein Interaction Prediction

Setup

Getting Start

Download Model Checkpoint

Generation

Prediction

Finetune CD-GPT on your own datasets

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages