Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization

The official implementation of the ACL'2025 paper Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization.

Brief Introduction

Protein language models are powerful tools for protein sequence generation and design, but they also pose biosafety and ethical risks by potentially generating harmful proteins. This project introduces a Knowledge-guided Preference Optimization (KPO) framework that leverages a Protein Safety Knowledge Graph and reinforcement learning to guide sequence generation towards safer outcomes.

Environments

To set up the environment for running KPO, use the command pip install -r requirements.txt.

Getting started

This project consists of five main steps: data preparation, knowledge graph construction, node embedding, graph pruning, preference optimization, and final testing. Please follow the instructions below to get started.

Step 1: Data Preparation

Download Raw Data
Inspired by OntoProtein, you will need to download the following files:
- go.obo (Gene Ontology ontology file)
- uniprot_sprot.dat (UniProt Swiss-Prot protein database)
- goa_uniprot_all.gaf (GOA annotation file, please use the latest version)
Generate the Protein Safety Knowledge Graph
Run the following command to parse the above files and generate the updated knowledge graph:
```
python Gen_PSKG.py
```

Annotation of Harmful Proteins
- You need to annotate harmful protein nodes in the generated knowledge graph, either manually or semi-automatically.
- Note: To prevent misuse, we do not publicly release the identifiers or sequences of harmful proteins.

Step 2: Obtain Protein Node Embeddings

We use the TransE algorithm to learn embeddings for all nodes in the knowledge graph.
Run the following command to train and generate embeddings:
```
python TransE.py
```

Step 3: Knowledge Graph Pruning & Preference Pair Construction

Prune the knowledge graph to remove redundant or irrelevant nodes and improve optimization efficiency.
For detailed pruning methods, refer to Section 4.2 "Node Pruning with Weighted Metrics" in the paper.
Run the following command to perform pruning and construct preference pairs:
```
python Construct_Data_Prune.py
```

Step 4: Knowledge-guided Preference Optimization of PLM

Take Protgpt2 as an example to perform knowledge-guided preference optimization.
Run the following command to start the optimization process:
```
python KPO_Protgpt2.py
```
For other models such as InstructProtein or Progen2, use the corresponding scripts in a similar manner.

Step 5: Testing

Generate Protein Sequences

Use both the original and optimized PLMs (e.g., Protgpt2) to generate 1000 protein sequences:
```
python test_Protgpt2.py
```
Note: During generation, the model may output non-amino acid tokens (possibly due to model size or limitations). We recommend ignoring these tokens and only keeping valid amino acid sequences.

Functional and Safety Evaluation

The generated protein sequences can be evaluated using various tools:
- BLAST: Download
- MMseq2: GitHub
- Toxinpred3: Online Tool
- HMMER: Download
- Pfam Database: Download
The data for Functional Evaluation can be found under the 'data' directory.

Reference

If you use our repository, please cite the following related paper:

@inproceedings{
anonymous2025enhancing,
title={Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization},
author={Anonymous},
booktitle={The 63rd Annual Meeting of the Association for Computational Linguistics},
year={2025},
url={https://openreview.net/forum?id=gydjrQqIue}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
fig		fig
Construct_Data_Prune.py		Construct_Data_Prune.py
Gen_PSKG.py		Gen_PSKG.py
KPO_InstructProtein.py		KPO_InstructProtein.py
KPO_Progen2.py		KPO_Progen2.py
KPO_Protgpt2.py		KPO_Protgpt2.py
README.md		README.md
TransE.py		TransE.py
requirement.txt		requirement.txt
test_InstructProtein.py		test_InstructProtein.py
test_Progen2.py		test_Progen2.py
test_Protgpt2.py		test_Protgpt2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization

Brief Introduction

Environments

Getting started

Step 1: Data Preparation

Step 2: Obtain Protein Node Embeddings

Step 3: Knowledge Graph Pruning & Preference Pair Construction

Step 4: Knowledge-guided Preference Optimization of PLM

Step 5: Testing

Reference

About

Uh oh!

Releases

Packages

Languages

HICAI-ZJU/KPO

Folders and files

Latest commit

History

Repository files navigation

Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization

Brief Introduction

Environments

Getting started

Step 1: Data Preparation

Step 2: Obtain Protein Node Embeddings

Step 3: Knowledge Graph Pruning & Preference Pair Construction

Step 4: Knowledge-guided Preference Optimization of PLM

Step 5: Testing

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages