A high-performance implementation of the THREADER algorithm for protein structure prediction through threading, featuring both the original 1998 algorithm and modern 2025 optimizations.
THREADER is a computational method for protein structure prediction that "threads" a target sequence through known 3D structures (templates) to find compatible folds. This implementation provides:
- Faithful 1998 implementation: Strict reproduction of the original Jones et al. algorithm
- Enhanced 2025 version: Advanced optimizations including frozen approximation, SS-specific scoring, and iterative refinement
- Comprehensive analysis: Z-score validation, detailed reports, and 3D model generation
- High-performance computing: GPU acceleration and optimized batch processing
- Double Dynamic Programming: Low-level and high-level matrix computations
- DOPE Statistical Potentials: Distance-dependent pairwise and solvation energies
- Secondary Structure Integration: DSSP-based or geometric SS assignment
- Z-score Statistical Validation: Empirical significance testing
- Frozen Approximation: Selective detailed calculation for performance
- SS-specific Matrices: Differentiated scoring for helices, sheets, and coils
- Iterative Refinement: Multi-pass alignment optimization
- Enhanced Selection: Improved candidate pair identification
- Comprehensive Reports: Detailed text analysis with timing metrics
- Statistical Plots: Energy breakdowns and Z-score distributions
- 3D Model Generation: PDB format threaded structures
- Batch Processing: Multiple template comparative analysis
Metric | 1998 Mode | 2025 Mode | Improvement |
---|---|---|---|
Threading Time | 1m 17s | 160 ms | 483x faster |
Final Score | -12,984.875 | 22.471 | Favorable energy |
Sequence Coverage | 73.2% | 100% | +26.8% |
Z-score | 5.0 | 5.548 | +10.9% |
- Python 3.9+
- Conda/Mamba or Pixi package manager
# Install pixi if not available
curl -fsSL https://pixi.sh/install.sh | bash
# Clone the repository
git clone <repository-url>
cd DIABIRA_code
# Install all dependencies from pixi.toml
pixi install
# Activate environment
pixi shell
The pixi.toml
file contains all necessary dependencies including:
- Python scientific stack (NumPy, SciPy, Pandas)
- Bioinformatics tools (BioPython)
- Visualization libraries (Matplotlib, Seaborn)
- Optional GPU acceleration (CuPy)
- Development tools and linters
# Create environment
conda create -n threader python=3.9
conda activate threader
# Install dependencies
pip install numpy scipy pandas matplotlib seaborn biopython
# For CUDA support
pip install cupy-cuda11x # or cupy-cuda12x
# For Metal (macOS)
pip install numpy # Uses Accelerate framework
2025 Mode (Optimized):
python main.py target.fasta template.pdb dope.par --out-dir results --mode 2025
1998 Mode (Historical):
python main.py target.fasta template.pdb dope.par --out-dir results --mode 1998
python main.py target.fasta templates_dir/ dope.par --out-dir batch_results --mode 2025
python main.py <target_sequence> <template_structure> <dope_potentials> --out-dir <output_dir> [OPTIONS]
target_sequence
: Target protein sequence (FASTA format)template_structure
: Template structure (PDB file) or directory for batch modedope_potentials
: DOPE statistical potentials file
--mode {1998,2025}
: Algorithm version (default: 2025)- 1998: Strict historical implementation
- 2025: All optimizations enabled by default
--no-frozen
: Disable frozen approximation--no-ss-specific
: Disable SS-specific matrices--no-refinement
: Disable iterative refinement--refine-iters N
: Refinement iterations (default: 10)--refine-window N
: Refinement window size (default: 7)
--zscore-analysis
: Enable Z-score computation (default: on)--n-shuffles N
: Number of random shuffles for Z-score (default: 100)--no-zscore
: Disable statistical analysis
--output-format {txt,plot,pdb}
: Output formats (default: txt,plot)--generate-model
: Generate 3D PDB model--save-plots
: Save analysis plots--model-gaps
: Include gaps in 3D model
--gpu
: Force GPU usage--compute-mode {auto,cpu,gpu}
: Computing backend--jobs N
: Number of parallel jobs
>Myoglobin_Human
MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASED
LKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPG
DFGADAQGAMNKALELFRKDMASNYKELGFQG
Standard Protein Data Bank format with coordinates and secondary structure information.
Tab-separated format with residue pairs, atom types, and distance-dependent energies:
ALA CA ALA CA -0.123 -0.098 -0.076 ...
ALA CA CYS CA -0.089 -0.067 -0.054 ...
results/
βββ report_template.txt # Detailed analysis report
βββ alignment_template_2025.tsv # Sequence-structure alignment
βββ threading_plot_template.png # Energy and coverage analysis
βββ zscore_distribution.png # Statistical significance plot
βββ threaded_model_template_2025.pdb # 3D structure (if enabled)
batch_results/
βββ target_vs_template1/ # Individual results
βββ target_vs_template2/
βββ ...
βββ summary_batch_results.csv # Comparative analysis
- Gap Penalties: High penalties (7.0 for H/E, 1.5 for coils)
- Distance Cutoff: 7.5 Γ contact definition
- Selection Method: Top-K per template residue (K=30)
- No Identity Bonus: Pure energy minimization
- SS Assignment: Geometric prediction only
- Frozen Approximation: Selective detailed calculation (10% initially, 30% refinement)
- SS-Specific Matrices: Differentiated substitution matrices for H/E/C
- Iterative Refinement: Multi-pass optimization with early stopping
- Identity Bonus: Enhanced scoring for identical residues
- Extended Cutoffs: 10 Γ interaction radius
- DSSP Integration: Accurate secondary structure assignment
1998 Mode:
Z(m,n,p,q) = Ξ£ Epair(ri,rj,dpq) + Esolv(ri,accp)
2025 Mode:
Zapprox(m,n,p,q) = Ξ±Β·Elocal(rm,rp) + Ξ²Β·Ξ£ Esimplified(rk,dk) + Ξ³Β·Ess_specific(ri,ssj)
Automatic detection and utilization of:
- CUDA (NVIDIA)
- Metal (Apple Silicon)
- OpenCL (Cross-platform)
- Adaptive batch sizing
- Memory-mapped operations
- Efficient matrix operations
- Reduced Z-score shuffles (50 vs 100)
- Disabled 3D model generation
- Parallel template processing
Statistical significance testing using:
- Random sequence shuffling
- Empirical p-value calculation
- Normal distribution fitting
- One-sided and two-sided tests
- Z < -4.0: Highly significant threading
- Z < -3.0: Significant threading
- Z < -2.0: Moderately significant
- Z < -1.0: Weakly significant
- Z β₯ -1.0: Not significant
- Sequence Coverage: Fraction of target residues aligned
- Structure Coverage: Fraction of template positions used
- Energy Decomposition: Pairwise, solvation, and SS-specific contributions
- Secondary Structure Conservation: SS-specific alignment quality
# Human myoglobin vs sperm whale myoglobin template
python main.py data/test_myoglobin.fasta data/1MBN.pdb data/dope.par \
--out-dir myoglobin_analysis --mode 2025 --use-dssp
# 1998 mode
python main.py data/target.fasta data/template.pdb data/dope.par \
--out-dir results_1998 --mode 1998
# 2025 mode
python main.py data/target.fasta data/template.pdb data/dope.par \
--out-dir results_2025 --mode 2025
# Process multiple templates
python main.py data/target.fasta data/template_library/ data/dope.par \
--out-dir batch_screening --mode 2025 --n-shuffles 30
# Fine-tuned 2025 mode
python main.py data/target.fasta data/template.pdb data/dope.par \
--out-dir custom_analysis --mode 2025 \
--refine-iters 15 --refine-window 9 --n-shuffles 200
DSSP Not Found:
# Install DSSP
conda install -c conda-forge dssp
# or use geometric fallback
python main.py ... --mode 1998 # No DSSP by default
GPU Memory Issues:
# Force CPU mode
python main.py ... --compute-mode cpu
Large Batch Processing:
# Reduce memory usage
python main.py ... --n-shuffles 30 --no-refinement
- Use GPU for large sequences (>200 residues)
- Enable frozen approximation for speed
- Reduce Z-score shuffles for batch mode
- Disable 3D models for high-throughput screening
- Original THREADER: Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). A new approach to protein fold recognition. Nature 358, 86-89.
- Double Dynamic Programming: Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292, 195-202.
- DOPE Potentials: Shen, M.Y. and Sali, A. (2006). Statistical potential for assessment and prediction of protein structures. Protein Science 15, 2507-2524.
This implementation is provided for academic and research purposes. Please cite appropriate references when using this software in published work.
For questions, bug reports, or feature requests, please refer to the project documentation or contact the development team.
Note: This is a research implementation designed for academic use. For production applications, consider additional validation and optimization based on your specific requirements.