THREADER: Enhanced Protein Threading via Double Dynamic Programming

A high-performance implementation of the THREADER algorithm for protein structure prediction through threading, featuring both the original 1998 algorithm and modern 2025 optimizations.

Overview

THREADER is a computational method for protein structure prediction that "threads" a target sequence through known 3D structures (templates) to find compatible folds. This implementation provides:

Faithful 1998 implementation: Strict reproduction of the original Jones et al. algorithm
Enhanced 2025 version: Advanced optimizations including frozen approximation, SS-specific scoring, and iterative refinement
Comprehensive analysis: Z-score validation, detailed reports, and 3D model generation
High-performance computing: GPU acceleration and optimized batch processing

Features

Core Algorithms

Double Dynamic Programming: Low-level and high-level matrix computations
DOPE Statistical Potentials: Distance-dependent pairwise and solvation energies
Secondary Structure Integration: DSSP-based or geometric SS assignment
Z-score Statistical Validation: Empirical significance testing

2025 Optimizations

Frozen Approximation: Selective detailed calculation for performance
SS-specific Matrices: Differentiated scoring for helices, sheets, and coils
Iterative Refinement: Multi-pass alignment optimization
Enhanced Selection: Improved candidate pair identification

Analysis & Output

Comprehensive Reports: Detailed text analysis with timing metrics
Statistical Plots: Energy breakdowns and Z-score distributions
3D Model Generation: PDB format threaded structures
Batch Processing: Multiple template comparative analysis

Performance Gains (1998 → 2025)

Metric	1998 Mode	2025 Mode	Improvement
Threading Time	1m 17s	160 ms	483x faster
Final Score	-12,984.875	22.471	Favorable energy
Sequence Coverage	73.2%	100%	+26.8%
Z-score	5.0	5.548	+10.9%

Installation

Prerequisites

Python 3.9+
Conda/Mamba or Pixi package manager

Using Pixi (Recommended)

# Install pixi if not available
curl -fsSL https://pixi.sh/install.sh | bash

# Clone the repository
git clone <repository-url>
cd DIABIRA_code

# Install all dependencies from pixi.toml
pixi install

# Activate environment
pixi shell

The pixi.toml file contains all necessary dependencies including:

Python scientific stack (NumPy, SciPy, Pandas)
Bioinformatics tools (BioPython)
Visualization libraries (Matplotlib, Seaborn)
Optional GPU acceleration (CuPy)
Development tools and linters

Using Conda

# Create environment
conda create -n threader python=3.9
conda activate threader

# Install dependencies
pip install numpy scipy pandas matplotlib seaborn biopython

Optional: GPU Support

# For CUDA support
pip install cupy-cuda11x  # or cupy-cuda12x

# For Metal (macOS)
pip install numpy  # Uses Accelerate framework

Quick Start

Single Template Threading

2025 Mode (Optimized):

python main.py target.fasta template.pdb dope.par --out-dir results --mode 2025

1998 Mode (Historical):

python main.py target.fasta template.pdb dope.par --out-dir results --mode 1998

Batch Processing

python main.py target.fasta templates_dir/ dope.par --out-dir batch_results --mode 2025

Usage

Command Line Interface

python main.py <target_sequence> <template_structure> <dope_potentials> --out-dir <output_dir> [OPTIONS]

Required Arguments

target_sequence: Target protein sequence (FASTA format)
template_structure: Template structure (PDB file) or directory for batch mode
dope_potentials: DOPE statistical potentials file

Algorithm Modes

--mode {1998,2025}: Algorithm version (default: 2025)
- 1998: Strict historical implementation
- 2025: All optimizations enabled by default

Optimization Control (2025 Mode)

--no-frozen: Disable frozen approximation
--no-ss-specific: Disable SS-specific matrices
--no-refinement: Disable iterative refinement
--refine-iters N: Refinement iterations (default: 10)
--refine-window N: Refinement window size (default: 7)

Analysis Options

--zscore-analysis: Enable Z-score computation (default: on)
--n-shuffles N: Number of random shuffles for Z-score (default: 100)
--no-zscore: Disable statistical analysis

Output Control

--output-format {txt,plot,pdb}: Output formats (default: txt,plot)
--generate-model: Generate 3D PDB model
--save-plots: Save analysis plots
--model-gaps: Include gaps in 3D model

Computing Options

--gpu: Force GPU usage
--compute-mode {auto,cpu,gpu}: Computing backend
--jobs N: Number of parallel jobs

Input File Formats

Target Sequence (FASTA)

>Myoglobin_Human
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASED
LKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPG
DFGADAQGAMNKALELFRKDMASNYKELGFQG

Template Structure (PDB)

Standard Protein Data Bank format with coordinates and secondary structure information.

DOPE Potentials

Tab-separated format with residue pairs, atom types, and distance-dependent energies:

ALA CA ALA CA    -0.123 -0.098 -0.076 ...
ALA CA CYS CA    -0.089 -0.067 -0.054 ...

Output Structure

Single Template Mode

results/
├── report_template.txt              # Detailed analysis report
├── alignment_template_2025.tsv      # Sequence-structure alignment
├── threading_plot_template.png      # Energy and coverage analysis
├── zscore_distribution.png          # Statistical significance plot
└── threaded_model_template_2025.pdb # 3D structure (if enabled)

Batch Mode

batch_results/
├── target_vs_template1/             # Individual results
├── target_vs_template2/
├── ...
└── summary_batch_results.csv        # Comparative analysis

Algorithm Details

1998 Mode: Historical Implementation

Gap Penalties: High penalties (7.0 for H/E, 1.5 for coils)
Distance Cutoff: 7.5 Å contact definition
Selection Method: Top-K per template residue (K=30)
No Identity Bonus: Pure energy minimization
SS Assignment: Geometric prediction only

2025 Mode: Enhanced Algorithm

Frozen Approximation: Selective detailed calculation (10% initially, 30% refinement)
SS-Specific Matrices: Differentiated substitution matrices for H/E/C
Iterative Refinement: Multi-pass optimization with early stopping
Identity Bonus: Enhanced scoring for identical residues
Extended Cutoffs: 10 Å interaction radius
DSSP Integration: Accurate secondary structure assignment

Energy Function

1998 Mode:

Z(m,n,p,q) = Σ Epair(ri,rj,dpq) + Esolv(ri,accp)

2025 Mode:

Zapprox(m,n,p,q) = α·Elocal(rm,rp) + β·Σ Esimplified(rk,dk) + γ·Ess_specific(ri,ssj)

Performance Optimization

GPU Acceleration

Automatic detection and utilization of:

CUDA (NVIDIA)
Metal (Apple Silicon)
OpenCL (Cross-platform)

Memory Management

Adaptive batch sizing
Memory-mapped operations
Efficient matrix operations

Batch Optimization

Reduced Z-score shuffles (50 vs 100)
Disabled 3D model generation
Parallel template processing

Validation and Quality Control

Z-score Analysis

Statistical significance testing using:

Random sequence shuffling
Empirical p-value calculation
Normal distribution fitting
One-sided and two-sided tests

Interpretation Guidelines

Z < -4.0: Highly significant threading
Z < -3.0: Significant threading
Z < -2.0: Moderately significant
Z < -1.0: Weakly significant
Z ≥ -1.0: Not significant

Quality Metrics

Sequence Coverage: Fraction of target residues aligned
Structure Coverage: Fraction of template positions used
Energy Decomposition: Pairwise, solvation, and SS-specific contributions
Secondary Structure Conservation: SS-specific alignment quality

Examples

Basic Threading Analysis

# Human myoglobin vs sperm whale myoglobin template
python main.py data/test_myoglobin.fasta data/1MBN.pdb data/dope.par \
  --out-dir myoglobin_analysis --mode 2025 --use-dssp

Comparative Study (1998 vs 2025)

# 1998 mode
python main.py data/target.fasta data/template.pdb data/dope.par \
  --out-dir results_1998 --mode 1998

# 2025 mode  
python main.py data/target.fasta data/template.pdb data/dope.par \
  --out-dir results_2025 --mode 2025

High-Throughput Batch Analysis

# Process multiple templates
python main.py data/target.fasta data/template_library/ data/dope.par \
  --out-dir batch_screening --mode 2025 --n-shuffles 30

Custom Optimization Settings

# Fine-tuned 2025 mode
python main.py data/target.fasta data/template.pdb data/dope.par \
  --out-dir custom_analysis --mode 2025 \
  --refine-iters 15 --refine-window 9 --n-shuffles 200

Troubleshooting

Common Issues

DSSP Not Found:

# Install DSSP
conda install -c conda-forge dssp
# or use geometric fallback
python main.py ... --mode 1998  # No DSSP by default

GPU Memory Issues:

# Force CPU mode
python main.py ... --compute-mode cpu

Large Batch Processing:

# Reduce memory usage
python main.py ... --n-shuffles 30 --no-refinement

Performance Tips

Use GPU for large sequences (>200 residues)
Enable frozen approximation for speed
Reduce Z-score shuffles for batch mode
Disable 3D models for high-throughput screening

Algorithm References

Original THREADER: Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). A new approach to protein fold recognition. Nature 358, 86-89.
Double Dynamic Programming: Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292, 195-202.
DOPE Potentials: Shen, M.Y. and Sali, A. (2006). Statistical potential for assessment and prediction of protein structures. Protein Science 15, 2507-2524.

License

This implementation is provided for academic and research purposes. Please cite appropriate references when using this software in published work.

Contact & Support

For questions, bug reports, or feature requests, please refer to the project documentation or contact the development team.

Note: This is a research implementation designed for academic use. For production applications, consider additional validation and optimization based on your specific requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.idea		.idea
core		core
data		data
docs		docs
results_batch_2025		results_batch_2025
results_myo_2025		results_myo_2025
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pixi.lock		pixi.lock
pixi.toml		pixi.toml

assadiab/ddp

Folders and files

Latest commit

History

Repository files navigation