GitHub - PKU-YuanGroup/ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

📣 News

[2025/04] Accepted by IEEE Transactions on Artificial Intelligence. Paper and EPGF (a test-time computation framwork which ensures that generated proteins are not only statistically coherent but also biologically viable)
[2025/01/07] Update some training codes for easier usage. Details in Logs.
[2025/01/01] We propose HME, a multimodal multitask Chemical LLMs.
[2024/07/17] Update a new version of the paper.
[2024/06/27] Release the codes for pretraining (Stage1) and instruction_tuning (Stage2). See Quick Train.
[2024/06/08] Opensource the instruction dataset on HuggingFace
[2024/04/25] Upload ProLLaMA_Stage_1 to HuggingFace. More information is in Others.
[2024/04/10] Add a script (in /scripts/mutation.py) to meature mutation effects.
[2024/02.29] Update the /scripts/infer.py to fix bugs.

🗝️ Abstract

Recent advances in Protein Language Models (PLMs) have transformed protein engineering, yet unlike their counterparts in Natural Language Processing (NLP), current PLMs exhibit a fundamental limitation: they excel in either Protein Language Understanding (PLU) or Protein Language Generation (PLG), but rarely both. This fragmentation hinders progress in protein engineering. To bridge this gap, we introduce ProLLaMA, a multitask protein language model enhanced by the Evolutionary Protein Generation Framework (EPGF). We construct** a comprehensive instruction dataset containing approximately 13 million samples with over 11,000 superfamily annotations** to facilitate better modeling of sequence-function landscapes. We leverage a two-stage training approach to develop ProLLaMA, a multitask LLM with protein domain expertise. Our EPGF addresses the mismatch between statistic language modeling and biological constraints through three innovations: a multi-dimensional interpretable scorer, hierarchical efficient decoding, and a probabilistic-biophysical joint selection mechanism. Extensive experiments demonstrate that ProLLaMA excels in both unconditional and controllable protein generation tasks, achieving superior structural quality metrics compared to existing PLMs. Additionally, ProLLaMA demonstrates strong understanding capabilities with a 67.1% exact match rate in superfamily prediction. EPGF significantly enhances the biological viability of generated sequences, as evidenced by improved biophysical scores (+4.3%) and structural metrics (+14.5%).

**I also have other AI for Science projects that may interest you.**

TaxDiff: Taxonomic-Guided Diffusion Model for Protein Sequence Generation

Navigating Chemical-Linguistic Sharing Space with Heterogeneous Molecular Encoding

DM-Assembler: Leveraging Domain Motif Assembler for Multi-objective, Multi-domain and Explainable Molecular Design

💡Highlights

Powerful model

Our ProLLaMA is the first model to our knowledge capable of simultaneously handling multiple PLP tasks, including generating proteins with specified functions based on the user's intent.

Large-scale Dataset

We construct a comprehensive instruction dataset containing approximately 13 million samples with superfamily annotations.

General training framework

We propose a training framework with scalability and efficiency that enables any general LLM to be trained as a proficient model for multiple tasks in Protein Language Processing.

Evolutionary Protein Generation Framework (EPGF)

EPGF is a test-time computation framwork, which ensures that generated protein sequences are not only statistically coherent but also biologically viable, addressing a critical limitation in current PLMs.

😮Main Results

The Overview of Evolutionary Protein Generation Framework (EPGF) code
ProLLaMA generates better protein sequences with EPGF. "Natural" denotes natural proteins.
The performance of ProLLaMA in Conditional Protein Generation. (controlled by the given superfamily descriptions)
Other results in the paper (protein superfamily prediction, protein solubility prediction, ...)

🚀Pipeline

The training framework we propose is as follows:

(A) Continual learning on protein language.
(B) Instruction tuning on multi-tasks.
(C) Expanding to more tasks by instruction tuning in the future.

🛠️Quick Inference

As ProLLaMA's architecture is the same as LLaMA2, you can use ProLLaMA for inference like using LLaMA2.

Follow the steps below to use our ProLLaMA for inference.

1.Install Requirements

torch==2.0.1
transformers==4.35.0
cuda==11.7

git clone https://github.com/Lyu6PosHao/ProLLaMA.git
cd ProLLaMA
pip install -r requirements.txt

2.Download Model

Download from Hugging Face

3.Usage

Just like using LLaMA2, three ways are provided here:

Commandline

CUDA_VISIBLE_DEVICES=0 python ./scripts/infer.py --model "GreatCaptainNemo/ProLLaMA" --interactive
#You can replace the model_path with your local path
#Make sure you use only one GPU for inference
#Use "python ./scripts/infer.py -h" for more details

Python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer,GenerationConfig
from tqdm import tqdm
device=torch.device('cuda:0')

##You can replace the file_path with your local path
tokenizer = AutoTokenizer.from_pretrained("GreatCaptainNemo/ProLLaMA", use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GreatCaptainNemo/ProLLaMA", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
generation_config = GenerationConfig(temperature=0.2,top_k=40, top_p=0.9,do_sample=True,num_beams=1,repetition_penalty=1.2,max_new_tokens=400)
model.eval()
print("####Enter 'exit' to exit.")
with torch.no_grad():
    while True:
        messages = []
        user=str(input("Input:"))
        if user.strip()=="exit":
            break
        inputs = tokenizer(user, return_tensors="pt").to(device)
        generate_ids = model.generate(inputs.input_ids,generation_config).to(device)
        response=tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        print("Output:", response)

LLaMA-Factory

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
python ./src/cli_demo.py \
      --model_name_or_path /path_to_your_model \
      --template llama2

4.Input Format

The instructions which you input to the model should follow the following format:

[Generate by superfamily] Superfamily=<xxx>
or
[Determine superfamily] Seq=<yyy>

Here are some examples of the input:

[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily>

#You can also specify the first few amino acids of the protein sequence:
[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily> Seq=<MKRVL

[Determine superfamily] Seq=<MAPGGMPREFPSFVRTLPEADLGYPALRGWVLQGERGCVLYWEAVTEVALPEHCHAECWGVVVDGRMELMVDGYTRVYTRGDLYVVPPQARHRARVFPGFRGVEHLSDPDLLPVRKR>

See this on all the optional superfamilies.

🛠️Quick Train

Stage 1

Prepare the dataset: put your dataset under ./scripts/pretraining_dataset. You dataset should be one or several txt files. Each line in the txt file should be one protein sequence in the format of "Seq=". We provided ./scripts/pretraining_dataset/example.txt as an example.
Run ./scripts/run_pt.sh

Stage 2

Prepare the dataset: download our instruction_dataset from HuggingFace and put the train_split under ./scripts/instruction_tuning_dataset. We provided ./scripts/instruction_tuning_dataset/example.json as an example.
Run ./scripts/run_it.sh
If you want to fine-tune our ProLLaMA on your own dataset instead of our instruction_dataset, you should process your data into the similar format like our instruction_dataset (or example.json).
It may be better to fine-tune ProLLaMA_Stage_1 instead of ProLLaMA if your dataset is relatively small and not relevant to superfamily tasks.

✒️Others

ProLLaMA of Stage 1

ProLLaMA_Stage_1 refers to the model obtained by continual pre-training LLaMA2 on the UniRef50 dataset, as shown in the pipeline. Model Weights

You can use ProLLaMA_Stage_1 in the same way as ProLLaMA. For example:

CUDA_VISIBLE_DEVICES=0 python ./scripts/infer.py --model "GreatCaptainNemo/ProLLaMA_Stage_1" --interactive
#You can replace the model_path with your local path
#Make sure you use only one GPU for inference
#Use "python ./scripts/infer.py -h" for more details

However, ProLLaMA_Stage_1's input format is a little different from ProLLaMA, since the former is only trained on pure protein sequences without nautral language instructions.

The input format:

Seq=
#You can also specify the first few amino acids of the protein sequence:
Seq=<MAPGGMPRE

You can perform instruction tuning on ProLLaMA_Stage_1 (or ProLLaMA) with your custom datasets, in order to make the model capable of your insterested PLP tasks.

We plan to build a more powerful ProLLaMA_Stage_1.

Logs

[2025-01-07]

The peft codes in the src/peft is not used. The directory has been renamed to src/peft(deprecated).
The checkpoints during training will be saved in ${output_dir}. And when "merge_when_finished" is True, the LoRA adapters will be merged into the base model, and the merged model will be saved in ${output_dir}_merged. Then you can easily use transformers.AutoModelForCausalLM.from_pretrained() to load the merged model directly.

✏️Citation

If you find our repo helpful, please consider citing us.

@article{lv2024prollama,
  title={ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing},
  author={Lv, Liuzhenghao and Lin, Zongying and Li, Hao and Liu, Yuyang and Cui, Jiaxi and Chen, Calvin Yu-Chian and Yuan, Li and Tian, Yonghong},
  journal={arXiv preprint arXiv:2402.16445},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
img		img
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
superfamilies.txt		superfamilies.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

📣 News

🗝️ Abstract

💡Highlights

Powerful model

Large-scale Dataset

General training framework

Evolutionary Protein Generation Framework (EPGF)

😮Main Results

🚀Pipeline

🛠️Quick Inference

1.Install Requirements

2.Download Model

3.Usage

4.Input Format

🛠️Quick Train

Stage 1

Stage 2

✒️Others

ProLLaMA of Stage 1

Logs

✏️Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

PKU-YuanGroup/ProLLaMA

Folders and files

Latest commit

History

Repository files navigation

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

📣 News

🗝️ Abstract

💡Highlights

Powerful model

Large-scale Dataset

General training framework

Evolutionary Protein Generation Framework (EPGF)

😮Main Results

🚀Pipeline

🛠️Quick Inference

1.Install Requirements

2.Download Model

3.Usage

4.Input Format

🛠️Quick Train

Stage 1

Stage 2

✒️Others

ProLLaMA of Stage 1

Logs

✏️Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages