Skip to content

assadiab/ddp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

THREADER: Enhanced Protein Threading via Double Dynamic Programming

A high-performance implementation of the THREADER algorithm for protein structure prediction through threading, featuring both the original 1998 algorithm and modern 2025 optimizations.

Overview

THREADER is a computational method for protein structure prediction that "threads" a target sequence through known 3D structures (templates) to find compatible folds. This implementation provides:

  • Faithful 1998 implementation: Strict reproduction of the original Jones et al. algorithm
  • Enhanced 2025 version: Advanced optimizations including frozen approximation, SS-specific scoring, and iterative refinement
  • Comprehensive analysis: Z-score validation, detailed reports, and 3D model generation
  • High-performance computing: GPU acceleration and optimized batch processing

Features

Core Algorithms

  • Double Dynamic Programming: Low-level and high-level matrix computations
  • DOPE Statistical Potentials: Distance-dependent pairwise and solvation energies
  • Secondary Structure Integration: DSSP-based or geometric SS assignment
  • Z-score Statistical Validation: Empirical significance testing

2025 Optimizations

  • Frozen Approximation: Selective detailed calculation for performance
  • SS-specific Matrices: Differentiated scoring for helices, sheets, and coils
  • Iterative Refinement: Multi-pass alignment optimization
  • Enhanced Selection: Improved candidate pair identification

Analysis & Output

  • Comprehensive Reports: Detailed text analysis with timing metrics
  • Statistical Plots: Energy breakdowns and Z-score distributions
  • 3D Model Generation: PDB format threaded structures
  • Batch Processing: Multiple template comparative analysis

Performance Gains (1998 β†’ 2025)

Metric 1998 Mode 2025 Mode Improvement
Threading Time 1m 17s 160 ms 483x faster
Final Score -12,984.875 22.471 Favorable energy
Sequence Coverage 73.2% 100% +26.8%
Z-score 5.0 5.548 +10.9%

Installation

Prerequisites

  • Python 3.9+
  • Conda/Mamba or Pixi package manager

Using Pixi (Recommended)

# Install pixi if not available
curl -fsSL https://pixi.sh/install.sh | bash

# Clone the repository
git clone <repository-url>
cd DIABIRA_code

# Install all dependencies from pixi.toml
pixi install

# Activate environment
pixi shell

The pixi.toml file contains all necessary dependencies including:

  • Python scientific stack (NumPy, SciPy, Pandas)
  • Bioinformatics tools (BioPython)
  • Visualization libraries (Matplotlib, Seaborn)
  • Optional GPU acceleration (CuPy)
  • Development tools and linters

Using Conda

# Create environment
conda create -n threader python=3.9
conda activate threader

# Install dependencies
pip install numpy scipy pandas matplotlib seaborn biopython

Optional: GPU Support

# For CUDA support
pip install cupy-cuda11x  # or cupy-cuda12x

# For Metal (macOS)
pip install numpy  # Uses Accelerate framework

Quick Start

Single Template Threading

2025 Mode (Optimized):

python main.py target.fasta template.pdb dope.par --out-dir results --mode 2025

1998 Mode (Historical):

python main.py target.fasta template.pdb dope.par --out-dir results --mode 1998

Batch Processing

python main.py target.fasta templates_dir/ dope.par --out-dir batch_results --mode 2025

Usage

Command Line Interface

python main.py <target_sequence> <template_structure> <dope_potentials> --out-dir <output_dir> [OPTIONS]

Required Arguments

  • target_sequence: Target protein sequence (FASTA format)
  • template_structure: Template structure (PDB file) or directory for batch mode
  • dope_potentials: DOPE statistical potentials file

Algorithm Modes

  • --mode {1998,2025}: Algorithm version (default: 2025)
    • 1998: Strict historical implementation
    • 2025: All optimizations enabled by default

Optimization Control (2025 Mode)

  • --no-frozen: Disable frozen approximation
  • --no-ss-specific: Disable SS-specific matrices
  • --no-refinement: Disable iterative refinement
  • --refine-iters N: Refinement iterations (default: 10)
  • --refine-window N: Refinement window size (default: 7)

Analysis Options

  • --zscore-analysis: Enable Z-score computation (default: on)
  • --n-shuffles N: Number of random shuffles for Z-score (default: 100)
  • --no-zscore: Disable statistical analysis

Output Control

  • --output-format {txt,plot,pdb}: Output formats (default: txt,plot)
  • --generate-model: Generate 3D PDB model
  • --save-plots: Save analysis plots
  • --model-gaps: Include gaps in 3D model

Computing Options

  • --gpu: Force GPU usage
  • --compute-mode {auto,cpu,gpu}: Computing backend
  • --jobs N: Number of parallel jobs

Input File Formats

Target Sequence (FASTA)

>Myoglobin_Human
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASED
LKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPG
DFGADAQGAMNKALELFRKDMASNYKELGFQG

Template Structure (PDB)

Standard Protein Data Bank format with coordinates and secondary structure information.

DOPE Potentials

Tab-separated format with residue pairs, atom types, and distance-dependent energies:

ALA CA ALA CA    -0.123 -0.098 -0.076 ...
ALA CA CYS CA    -0.089 -0.067 -0.054 ...

Output Structure

Single Template Mode

results/
β”œβ”€β”€ report_template.txt              # Detailed analysis report
β”œβ”€β”€ alignment_template_2025.tsv      # Sequence-structure alignment
β”œβ”€β”€ threading_plot_template.png      # Energy and coverage analysis
β”œβ”€β”€ zscore_distribution.png          # Statistical significance plot
└── threaded_model_template_2025.pdb # 3D structure (if enabled)

Batch Mode

batch_results/
β”œβ”€β”€ target_vs_template1/             # Individual results
β”œβ”€β”€ target_vs_template2/
β”œβ”€β”€ ...
└── summary_batch_results.csv        # Comparative analysis

Algorithm Details

1998 Mode: Historical Implementation

  • Gap Penalties: High penalties (7.0 for H/E, 1.5 for coils)
  • Distance Cutoff: 7.5 Γ… contact definition
  • Selection Method: Top-K per template residue (K=30)
  • No Identity Bonus: Pure energy minimization
  • SS Assignment: Geometric prediction only

2025 Mode: Enhanced Algorithm

  • Frozen Approximation: Selective detailed calculation (10% initially, 30% refinement)
  • SS-Specific Matrices: Differentiated substitution matrices for H/E/C
  • Iterative Refinement: Multi-pass optimization with early stopping
  • Identity Bonus: Enhanced scoring for identical residues
  • Extended Cutoffs: 10 Γ… interaction radius
  • DSSP Integration: Accurate secondary structure assignment

Energy Function

1998 Mode:

Z(m,n,p,q) = Ξ£ Epair(ri,rj,dpq) + Esolv(ri,accp)

2025 Mode:

Zapprox(m,n,p,q) = Ξ±Β·Elocal(rm,rp) + Ξ²Β·Ξ£ Esimplified(rk,dk) + Ξ³Β·Ess_specific(ri,ssj)

Performance Optimization

GPU Acceleration

Automatic detection and utilization of:

  • CUDA (NVIDIA)
  • Metal (Apple Silicon)
  • OpenCL (Cross-platform)

Memory Management

  • Adaptive batch sizing
  • Memory-mapped operations
  • Efficient matrix operations

Batch Optimization

  • Reduced Z-score shuffles (50 vs 100)
  • Disabled 3D model generation
  • Parallel template processing

Validation and Quality Control

Z-score Analysis

Statistical significance testing using:

  • Random sequence shuffling
  • Empirical p-value calculation
  • Normal distribution fitting
  • One-sided and two-sided tests

Interpretation Guidelines

  • Z < -4.0: Highly significant threading
  • Z < -3.0: Significant threading
  • Z < -2.0: Moderately significant
  • Z < -1.0: Weakly significant
  • Z β‰₯ -1.0: Not significant

Quality Metrics

  • Sequence Coverage: Fraction of target residues aligned
  • Structure Coverage: Fraction of template positions used
  • Energy Decomposition: Pairwise, solvation, and SS-specific contributions
  • Secondary Structure Conservation: SS-specific alignment quality

Examples

Basic Threading Analysis

# Human myoglobin vs sperm whale myoglobin template
python main.py data/test_myoglobin.fasta data/1MBN.pdb data/dope.par \
  --out-dir myoglobin_analysis --mode 2025 --use-dssp

Comparative Study (1998 vs 2025)

# 1998 mode
python main.py data/target.fasta data/template.pdb data/dope.par \
  --out-dir results_1998 --mode 1998

# 2025 mode  
python main.py data/target.fasta data/template.pdb data/dope.par \
  --out-dir results_2025 --mode 2025

High-Throughput Batch Analysis

# Process multiple templates
python main.py data/target.fasta data/template_library/ data/dope.par \
  --out-dir batch_screening --mode 2025 --n-shuffles 30

Custom Optimization Settings

# Fine-tuned 2025 mode
python main.py data/target.fasta data/template.pdb data/dope.par \
  --out-dir custom_analysis --mode 2025 \
  --refine-iters 15 --refine-window 9 --n-shuffles 200

Troubleshooting

Common Issues

DSSP Not Found:

# Install DSSP
conda install -c conda-forge dssp
# or use geometric fallback
python main.py ... --mode 1998  # No DSSP by default

GPU Memory Issues:

# Force CPU mode
python main.py ... --compute-mode cpu

Large Batch Processing:

# Reduce memory usage
python main.py ... --n-shuffles 30 --no-refinement

Performance Tips

  • Use GPU for large sequences (>200 residues)
  • Enable frozen approximation for speed
  • Reduce Z-score shuffles for batch mode
  • Disable 3D models for high-throughput screening

Algorithm References

  • Original THREADER: Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). A new approach to protein fold recognition. Nature 358, 86-89.
  • Double Dynamic Programming: Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292, 195-202.
  • DOPE Potentials: Shen, M.Y. and Sali, A. (2006). Statistical potential for assessment and prediction of protein structures. Protein Science 15, 2507-2524.

License

This implementation is provided for academic and research purposes. Please cite appropriate references when using this software in published work.

Contact & Support

For questions, bug reports, or feature requests, please refer to the project documentation or contact the development team.


Note: This is a research implementation designed for academic use. For production applications, consider additional validation and optimization based on your specific requirements.

About

πŸ§¬πŸ’» Protein threading framework based on double dynamic programming

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages