Skip to content
baku edited this page Jul 29, 2024 · 3 revisions

Requirements to Build

  • rustc >= 1.65.0
  • cmake: (for all build targets except WASM) to boost the indexing of Reference.

Supported Alignment Algorithms

Default Algorithms

SigAlign offers two primary alignment algorithms: "semi-global" and "local".

In the semi-global algorithm, at least one of the sequences (either the query or the target) is completely consumed at each alignment end. Examples include:

  • Case 1
    QUERY : -------------
                |||||||||
    TARGET:     -------------
    
  • Case 2
    QUERY :     -------------
                |||||||||
    TARGET: -------------
    
  • Case 3
    QUERY : -------------
              |||||||
    TARGET:   -------
    
  • Case 4
    QUERY :   -------
              |||||||
    TARGET: -------------
    

In the local algorithm, the alignment may include only parts of the query and target sequence. For example:

  • Case 1
    QUERY : ----------------
                  ||||||
    TARGET:    ----------------
    

Variations of Default Algorithms

The sigalign crate also provides two additional variants of the alignment algorithms:

  • Limit: The algorithm stops once a specified number of alignments meeting the cutoff are produced. This process is greedy and does not guarantee that the alignments included are more optimal (in terms of smaller penalties or longer lengths) than those not included.
  • Chunk: The query is divided into multiple chunks for alignment. You can specify the size of the chunks and the sliding size.

Your revised version is clear and mostly free of grammatical errors. Here are a few minor corrections to improve clarity and correctness:

Definition of Alignment Results in SigAlign

SigAlign aims to provide non-heuristic results based on two cutoffs (MinL, MaxP), reproducing the results of the Dynamic Programming (DP) method, considered the computational gold standard in alignment algorithms.

Understanding SigAlign's Results Using DP Method

Semi-global Algorithm

SigAlign's semi-global algorithm results include the results generated by the following DP method:

  1. Initiate the DP matrix.
  2. Fill the DP matrix.
  3. Extract all semi-global alignments from the matrix.
  4. Output all alignments that do not overlap with more optimal alignments and meet the cutoffs.

The DP matrix specification is as follows:

  • All margin (clipping) penalties are set to 0 during initialization (similar to the Smith-Waterman algorithm).
  • When filling the matrix, operations with the same penalty are prioritized in the following order: 1) Match, 2) Insertion, 3) Deletion, 4) Substitution.

The more "optimal" alignment is defined as:

  • Lesser penalty.
  • If penalties are equal, the larger the last query index of alignment (reaching further in the matrix).

"Overlap" is determined by any connected (match or substitution) position pairs between the target and query sequences. For example, in alignment result A, if the i-th base pair of the query and the j-th base pair of the target are connected, another alignment B sharing the same connected position pair (i, j) is considered overlapping with A.

Local Algorithm

SigAlign's local algorithm includes results of the following DP method:

  1. For all substrings of the query, perform the semi-global algorithm as described above.
  2. Output alignments that meet the cutoffs and do not overlap with more optimal alignments.

The "optimal" alignment in the local algorithm is defined as:

  • Longer substring.
  • If substrings are of the same length, the smaller starting index in the query.
  • Subsequent criteria are the same as in the semi-global algorithm.

The definition of overlap is the same as in the semi-global algorithm.

Notification on Rare Cases

Note that SigAlign guarantees to include all alignments generated by the DP method. This means that, in most cases, SigAlign's results are identical to those of the DP method. However, in scenarios with numerous alignment results, such as repetitive sequences, SigAlign may produce more results than the DP method.

Additionally, since certain operations must be prioritized over others (in SigAlign, the order is Match > Insertion > Deletion > Substitution), SigAlign's results can be asymmetric with respect to the direction of the query and target. For example, the alignment results of A as the query and B as the target may differ from B as the query and A as the target.

(Additional specific examples will be included)

Multi-crates Structure

SigAlign's algorithm is designed to be flexible and agnostic to sequence storage and k-mer indexing methods. Consequently, SigAlign's functionality is distributed across several crates to ensure maximum adaptability and usability based on specific needs.

The core crates are:

  • sigalign-core: Provides core algorithms.
  • sigalign-impl: Offers implementations for core traits (PatternIndex and SequenceStorage).
  • sigalign-utils: Provides utilities (e.g., FASTA/FASTQ readers, decompressing tools).

sigalign acts as a wrapper for these crates, defining basic structs and providing a convenient interface. For example, sigalign-core's Reference is defined as struct Reference<I: PatternIndex, S: SequenceStorage>, whereas sigalign defines it as struct Reference without generics, using Reference<DynamicLfi, InMemoryStorage> internally but hiding this complexity.

Generally, sigalign suffices for most uses, but custom implementations (e.g., network-based SequenceStorage) may require using sigalign-core.

Dependency structure among crates:

graph LR
    A[sigalign-core]
    B[sigalign-impl]
    C[sigalign-utils]
    D[sigalign]

    B --> A
    B --> C
    D --> A
    D --> B
    D --> C
Loading

Version Comparability

  • sigalign: 0.4

    • sigalign-core: 0.3
    • sigalign-impl: 0.3
    • sigalign-utils: 0.3
    • sigalign-py: 0.3
  • sigalign: 0.3

    • sigalign-core: 0.2
    • sigalign-impl: 0.2
    • sigalign-utils: 0.2
    • sigalign-py: 0.2