Skip to content

TimoLassmann/kalign

Repository files navigation

CMake Python Build Python Wheels CodeQL GitHub stars GitHub issues

Kalign

Kalign is a fast multiple sequence alignment program for biological sequences written in C with Python bindings.

🚀 Key Features

  • 🔥 High Performance: Fast multiple sequence alignment with multi-threading support
  • ⚡ Smart Threading: Auto-detects CPU cores and uses N-1 threads by default (max 16) for optimal performance
  • 🔧 Cross-Platform: Works on Linux and macOS with multiple build systems (CMake, Zig)
  • 📊 Multiple Formats: FASTA, MSF, Clustal, Stockholm, PHYLIP support
  • 🧬 Sequence Types: Optimized for protein, DNA, RNA, and divergent sequences
  • ⚡ SIMD Optimizations: Vectorized code for x86_64 systems (SSE4.1, AVX, AVX2)
  • 🐍 Python Integration: Modern Python package with comprehensive bioinformatics ecosystem support

Installation

From Source (Primary)

Prerequisites

  • C compiler (GCC, Clang, or MSVC)
  • CMake (3.18 or higher)
  • OpenMP (optional, for parallelization)

Basic Build

# Download and extract latest release
tar -zxvf kalign-<version>.tar.gz
cd kalign-<version>

# Build
mkdir build && cd build
cmake ..
make
make test
make install

macOS with Homebrew

On macOS, install dependencies first:

# Install dependencies
brew install cmake

# For OpenMP support (recommended)
brew install libomp

# Clone and build
git clone https://github.com/TimoLassmann/kalign.git
cd kalign
mkdir build && cd build
cmake ..
make
make test
make install

Note: On macOS, Kalign automatically configures OpenMP with Homebrew's libomp installation at /opt/homebrew/opt/libomp/.

Alternative Build Systems

Zig Build (for cross-compilation):

zig build

Debug Build:

cmake -DCMAKE_BUILD_TYPE=Debug ..
make

Without OpenMP:

cmake -DUSE_OPENMP=OFF ..
make

Python Package

For development or latest features, install from source:

pip install git+https://github.com/TimoLassmann/kalign.git

For enhanced bioinformatics ecosystem integration:

pip install "kalign[biopython] @ git+https://github.com/TimoLassmann/kalign.git"    # + Biopython integration
pip install "kalign[skbio] @ git+https://github.com/TimoLassmann/kalign.git"        # + scikit-bio integration  
pip install "kalign[all] @ git+https://github.com/TimoLassmann/kalign.git"          # Full ecosystem support

Usage

Command Line Interface

Usage: kalign  -i <seq file> -o <out aln> 

Options:

   --format           : Output format. [Fasta]
   --type             : Alignment type (rna, dna, internal). [rna]
                        Options: protein, divergent (protein) 
                                 rna, dna, internal (nuc). 
   --gpo              : Gap open penalty. []
   --gpe              : Gap extension penalty. []
   --tgpe             : Terminal gap extension penalty. []
   -n/--nthreads      : Number of threads. [auto: N-1, max 16]
   --version (-V/-v)  : Prints version. [NA]

Threading Behavior

New in this version: Kalign automatically detects your system's CPU cores and uses N-1 threads by default (leaving one core free), with a maximum of 16 threads. This provides good performance out-of-the-box while maintaining system responsiveness.

  • Auto-detection: Uses CPU cores - 1 (e.g., 15 threads on a 16-core system)
  • Maximum cap: Never uses more than 16 threads
  • Manual override: Use -n/--nthreads to specify a custom thread count
  • Single-threaded: Use -n 1 to disable parallelization

Input Formats

Kalign accepts:

  • Unaligned sequences: FASTA format
  • Pre-aligned sequences: FASTA, MSF, or Clustal format (gaps will be removed and sequences re-aligned)

Sequence Types

Kalign automatically detects sequence types but offers manual control via --type:

  • protein: Uses CorBLOSUM66_13plus substitution matrix (default for protein)
  • divergent: Uses Gonnet 250 substitution matrix for highly divergent proteins
  • dna: DNA parameters (match: +5, mismatch: -4, gap open: -8, gap ext: -6)
  • rna: Optimized parameters for RNA alignments
  • internal: Like DNA but encourages internal gaps (terminal gap penalty: 8)

Fine-tune with --gpo (gap open), --gpe (gap extension), and --tgpe (terminal gap extension).

Python API

import kalign

# Align DNA sequences
sequences = [
    "ATCGATCGATCG",
    "ATCGTCGATCG", 
    "ATCGATCATCG"
]

aligned = kalign.align(sequences, seq_type="dna")
for seq in aligned:
    print(seq)

For comprehensive Python documentation, see README-python.md and the python-docs directory.

Examples

Basic Usage

Pass sequences via stdin:

cat input.fa | kalign -f fasta > out.afa

Combine multiple input files:

kalign seqsA.fa seqsB.fa seqsC.fa -f fasta > combined.afa

Use optimal threading (auto-detected):

kalign -i sequences.fa -o aligned.afa  # Uses N-1 threads automatically

Custom threading:

kalign -i sequences.fa -o aligned.afa -n 8  # Use exactly 8 threads

Format Conversion

MSF format:

kalign -i BB11001.tfa -f msf -o out.msf

Clustal format:

kalign -i BB11001.tfa -f clu -o out.clu

Re-align existing alignment:

kalign -i BB11001.msf -o out.afa

Library Integration

CMake Integration

Link Kalign into your C/C++ projects:

find_package(kalign)
target_link_libraries(<target> kalign::kalign)

Direct inclusion:

if (NOT TARGET kalign)
  add_subdirectory(<path_to_kalign>/kalign EXCLUDE_FROM_ALL)
endif ()
target_link_libraries(<target> kalign::kalign)

Python Module Development

Local development:

uv pip install -e .

Build Python module with CMake:

mkdir build && cd build
cmake -DBUILD_PYTHON_MODULE=ON ..
make

Performance

Benchmark Results

Kalign performs well for both speed and accuracy:

Balibase

Balibase_scores

Bralibase

Bralibase_scores

Performance Features

  • Multi-threading: Automatic CPU core detection with OpenMP parallelization
  • SIMD optimizations: Vectorized algorithms on x86_64 systems (SSE4.1, AVX, AVX2)
  • Bit-parallel algorithms: Myers' algorithm for efficient alignment
  • Memory optimization: Custom allocation strategies for large datasets

Performance Tips

  • Let auto-threading work: The default N-1 threading usually provides good performance
  • Large datasets: Consider using --type internal for sequences with many gaps
  • Memory: For very large alignments, monitor memory usage and consider reducing thread count
  • x86_64 systems: SIMD optimizations provide additional speedup on Intel/AMD processors

Contributing

We welcome contributions! See our Contributing Guide for details on:

  • Reporting bugs and requesting features
  • Development environment setup
  • Code style guidelines
  • Pull request process

Community Standards

This project follows the Contributor Covenant Code of Conduct. By participating, you agree to uphold this code.

System Requirements

  • Linux: GCC 4.8+ or Clang 3.4+
  • macOS: Xcode 8+ or Homebrew GCC/Clang
  • Memory: ~1GB RAM per 10,000 sequences (typical)
  • CPU: Any modern processor (additional optimizations on x86_64)

Troubleshooting

Common Issues

macOS OpenMP: If you see OpenMP-related errors on macOS:

brew install libomp
# Kalign automatically finds Homebrew's OpenMP installation

Python module: For Python installation issues:

pip install --upgrade pip setuptools wheel
pip install git+https://github.com/TimoLassmann/kalign.git

Threading: If performance seems slow, check thread detection:

kalign --help  # Shows current thread default
kalign -i test.fa -n 1 -o out.fa  # Force single-threaded for testing

For more troubleshooting, see python-docs/python-troubleshooting.md.

Citation

Please cite Kalign in your publications:

  1. Lassmann, Timo. Kalign 3: multiple sequence alignment of large data sets. Bioinformatics (2019). DOI | PDF

Previous Versions

  1. Lassmann, Timo, Oliver Frings, and Erik LL Sonnhammer. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic acids research 37.3 (2008): 858-865. PubMed

  2. Lassmann, Timo, and Erik LL Sonnhammer. Kalign: an accurate and fast multiple sequence alignment algorithm. BMC bioinformatics 6.1 (2005): 298. PubMed

License

Kalign is licensed under the GNU General Public License v3.0. See COPYING for details.

About

A fast multiple sequence alignment program.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published