Skip to content

5 Optimizing sensitivity and resource usage

Andrzej Zielezinski edited this page Oct 9, 2024 · 20 revisions

This section outlines how to manage Vclust's run time, memory, and disk space usage while maintaining high sensitivity. These considerations are essential if your dataset contains millions of sequences or has a high level of redundancy.

5.1. Prefilter

The prefilter command can use a lot of resources (memory consumption, runtime and disk space), if the parameters are not set appropriately.

5.1.1. Process genomes in smaller batches

Vclust can reduce memory usage by processing genome datasets in smaller, equally sized batches. Although this may slightly increase runtime, it significantly reduces memory requirements without impacting sensitivity. For instance, processing 15.5 million IMG/VR contigs in batches of 2 million sequences requires 246 GB of RAM, compared to 1 TB of RAM when processed in a single batch, with only a 30-minute increase in runtime.

# Process genomes in batches of 2 million sequences.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 2000000

2.1.2. Analyze only a fraction of k-mers

By default, the prefilter command analyzes all k-mers for each genome, but you can limit this to a fraction to significantly reduce memory usage and runtime. Reducing the k-mers fraction has minimal impact on sensitivity, as results from a subset of k-mers are generally comparable to the full set. For example, analyzing 20% of the k-mers in a large dataset recalled nearly all genome pairs, with fewer than 100 missed pairs and false positives, while reducing memory and runtime by nearly five-fold. This option does not affect alignment-based ANI calculations, as alignments are performed on full genome sequences.

The --kmers-fraction option controls the proportion [0-1] of k-mers used in comparisons:

# Process genomes in batches and analyze 10% of k-mers in each genome sequence.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 2000000 --kmers-fraction 0.1

2.1.3. Limit the number of target sequences per query

For highly redundant datasets (e.g., hundreds of thousands of nearly identical genomes), the prefilter step may still pass a large number of genome pairs, increasing both memory usage and runtime. The --max-seqs option limits the number of target sequences reported for each query genome, reducing the overall number of genome pairs passed to alignment. For each query, --max-seqs returns up to n sequences that have passed the --min-kmers and --min-ident filters, and have the highest sequence identity to query sequence. For example, in a dataset containing 1 million nearly identical genomes, the total number of possible genome pairs is almost 500 billion. Setting --max-seqs 1000 reduces this to 1 billion genome pairs, significantly decreasing memory usage and runtime.

# Limit the number of target sequences to 1000 per query genome.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 100000 --max-seqs 1000