InstructNA

Functional nucleic acids (FNAs) are essential for designing advanced molecular tools across multiple fields, yet their de novo design faces challenges due to the vast sequence space and inefficiency of experimental screening methods. Nucleic acid large language models (NA-LLMs) offer new opportunities for FNA design, but their generative capabilities remain underexplored. Here, we introduce InstructNA, a novel framework that leverages NA-LLMs augmented with high-throughput SELEX (HT-SELEX) to guide de novo design of FNAs without relying on structural information.

Environment Setup

OS: Ubuntu 20.04
Python: 3.9
CUDA: 12.6
Main Dependencies:
- torch==2.7.1
- transformers==4.53.2
- HEBO==0.3.6

Create Environment with Conda

conda create -n InstructNA python=3.9
conda activate InstructNA
pip install -r requirements.txt

Training

The training of the InstructNA model is designed in two stages: the encoder training stage and the decoder training stage.

1. Training encoder

python examples/DNABERT_3mers/Train.py \
    --dataset_dir /path/to/unique.csv \
    --batchsize 512 \
    --train_val_split_ratio 0.9 \
    --train_epoches 50 \
    --dataloader_num_worker 64 \
    --dataset_preprocess_num_worker 8 \
    --evaluate_per_step 10 \
    --random_seeds 42 \
    --checkpoint_save_path /path/to/save_model \
    --max_checkpoint_save_num 3 \
    --wandb_exp_name None \
    --warmup_steps 100 \
    --gradient_accumulation_steps 1 \
    --lr 1e-4 \
    --device cuda \
    --train_encoder_or_decoder train_encoder \
    --fintune_encoder_weight_path None \

2. Training decoder

python examples/DNABERT_3mers/Train.py \
    --dataset_dir /path/to/unique.csv \
    --batchsize 512 \
    --train_val_split_ratio 0.9 \
    --train_epoches 50 \
    --dataloader_num_worker 64 \
    --dataset_preprocess_num_worker 8 \
    --evaluate_per_step 10 \
    --random_seeds 42 \
    --checkpoint_save_path /path/to/save_model \
    --max_checkpoint_save_num 3 \
    --wandb_exp_name None \
    --warmup_steps 100 \
    --gradient_accumulation_steps 1 \
    --lr 1e-4 \
    --device cuda \
    --train_encoder_or_decoder train_decoder \
    --fintune_encoder_weight_path /path/to/encoder_model_weight \

Inference

1. construct seeding sequences

The prior sequences are derived from diverse sources, including:

1.1 The top 10 most frequent sequences from SELEX data.

python examples/DNABERT_3mers/seeds_construct/get_top10_from_fastq.py \
    --fastq_file /path/to/input.fastq \
    --output_csv /path/to/output_top10.csv

1.2 Sequences generated based on the Gaussian Mixture Model (GMM) cluster centers derived from the SELEX dataset.

python examples/DNABERT_3mers/seeds_construct/get_GMM_center_seqs.py \
  --sequences_dir /path/to/unique_seqs.csv \
  --tSNE_visual False \
  --encoder_model_path /path/to/encoder_model \
  --decoder_model_path /path/to/decoder_model \
  --output_dir /path/to/output \

1.3 Sequences from the SELEX dataset whose embeddings are proximal to the high-functional sequences identified in (1) and (2).

python examples/DNABERT_3mers/seeds_construct/get_near_seqs.py \
  --seqs_fre_dir /path/to/seqs_fre.csv \
  --seq_acts /path/to/seqs_act.csv \
  --encoder_model_path /path/to/encoder_model \
  --decoder_model_path /path/to/decoder_model \
  --output_dir /path/to/output \

1.4 After obtaining the sequences from steps (1), (2), and (3), cluster them to get 10 representative sequences using the following script:

python examples/DNABERT_3mers/seeds_construct/get_final_seeds.py \
  --seq_acts /path/to/Seeds_act.csv \
  --encoder_model_path /path/to/encoder_model \
  --decoder_model_path /path/to/decoder_model \
  --output_dir /path/to/output_dir \

2. Optimize the seeding sequences as following

python examples/DNABERT_3mers/single_BO_inference.py \
  --SELEX_path /path/to/SELEX_unique_seqs.csv \
  --seq_act_path /path/to/Seeds_act.csv \
  --encoder_model_path /path/to/encoder_checkpoint \
  --decoder_model_path /path/to/decoder_checkpoint \
  --BO_output_dir /path/to/output_dir \
  --search_r 5.0 \
  --max False \
  --use_filter False \
  --HC_HEBO_batchsize 20 \
  --f_primer 5_primer \
  --r_primer 3_primer

A pipeline for optimizing public transcription factor binding specificity using InstructNA

To validation the performance of InstructNA. We use the public SELEX datasets from DNA-Binding Specificities of Human Transcription Factors, and the PBM data from the Evaluation of methods for modeling transcription factor sequence specificity. The pipeline script is as follows:

python examples/DNABERT_3mers/TFs_InstrcutNA_pipeline.py \
  --fastq_dir  /path/to/SELEX_fastq\
  --BO_type HC-HEBO \
  --label_dir /path/to/PBM_data \
  --encoder_model_path /path/to/encoder_checkpoint \
  --decoder_model_path /path/to/decoder_checkpoint \
  --BO_output_dir /path/to/output_dir \
  --init_search_r 5.0 \
  --min_r 1.25 \
  --BO_cycle_nums 10 \
  --bo_batchsize 10 \

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
InstructNA_frameworks		InstructNA_frameworks
data		data
docs		docs
examples/DNABERT_3mers		examples/DNABERT_3mers
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InstructNA

Environment Setup

Create Environment with Conda

Training

1. Training encoder

2. Training decoder

Inference

1. construct seeding sequences

1.1 The top 10 most frequent sequences from SELEX data.

1.2 Sequences generated based on the Gaussian Mixture Model (GMM) cluster centers derived from the SELEX dataset.

1.3 Sequences from the SELEX dataset whose embeddings are proximal to the high-functional sequences identified in (1) and (2).

1.4 After obtaining the sequences from steps (1), (2), and (3), cluster them to get 10 representative sequences using the following script:

2. Optimize the seeding sequences as following

A pipeline for optimizing public transcription factor binding specificity using InstructNA

About

Uh oh!

Releases

Packages

Uh oh!

Languages

zhimingzhang275/InstructNA

Folders and files

Latest commit

History

Repository files navigation

InstructNA

Environment Setup

Create Environment with Conda

Training

1. Training encoder

2. Training decoder

Inference

1. construct seeding sequences

1.1 The top 10 most frequent sequences from SELEX data.

1.2 Sequences generated based on the Gaussian Mixture Model (GMM) cluster centers derived from the SELEX dataset.

1.3 Sequences from the SELEX dataset whose embeddings are proximal to the high-functional sequences identified in (1) and (2).

1.4 After obtaining the sequences from steps (1), (2), and (3), cluster them to get 10 representative sequences using the following script:

2. Optimize the seeding sequences as following

A pipeline for optimizing public transcription factor binding specificity using InstructNA

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages