Skip to content

5 Optimizing sensitivity and resource usage

Andrzej Zielezinski edited this page Oct 9, 2024 · 20 revisions

To reduce memory consumption or improve processing speed, you can use the following three methods, either individually or in combination:

2.1.1 Process genomes in smaller batches

Vclust can lower memory usage by processing the genome dataset in smaller, equally-sized batches. While this approach may slightly increase runtime, it significantly reduces memory consumption:

# Process genomes in batches of 5 million sequences.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 5000000

2.1.2. Analyze only a fraction of k-mers

To speed up filtering, Vclust can analyze only a fraction of the k-mers for each genome. The --kmers-fraction option controls the proportion [0-1] of k-mers used in comparisons:

# Process genomes in batches and analyze 10% of k-mers in each genome sequence.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 5000000 --kmers-fraction 0.1

2.1.3. Limit the number of target sequences per query

For highly redundant datasets (e.g., hundreds of thousands of nearly identical genomes), the prefilter step may still pass a large number of genome pairs, increasing both memory usage and runtime. The --max-seqs option limits the number of target sequences reported for each query genome, reducing the overall number of genome pairs passed to alignment. For each query, --max-seqs returns up to n sequences that have passed the --min-kmers and --min-ident filters, and have the highest sequence identity to query sequence. For example, in a dataset containing 1 million nearly identical genomes, the total number of possible genome pairs is almost 500 billion. Setting --max-seqs 1000 reduces this to 1 billion genome pairs, significantly decreasing memory usage and runtime.

# Limit the number of target sequences to 1000 per query genome.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 100000 --max-seqs 1000