-
Notifications
You must be signed in to change notification settings - Fork 4
5 Optimizing sensitivity and resource usage
This section explains how to maintain reasonable run time, memory usage, and disk space while achieving the highest possible sensitivity with Vclust. These considerations are essential if your dataset contains millions of sequences or has a high level of redundancy.
The prefilter command can use a lot of resources (memory consumption, runtime and disk space), if the parameters are not set appropriately.
Vclust can lower memory usage by processing the genome dataset in smaller, equally sized batches, which may slightly increase runtime but significantly reduces memory consumption without affecting sensitivity. For example, processing 15.5 million IMG/VR contigs in batches of 2 million sequences requires 246 GB of RAM while in one batch it would require 1 TB of RAM, and the run time was only 30 minutes longer.
# Process genomes in batches of 2 million sequences.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 2000000
By default, the prefilter command analyzes all k-mers for each genome, but it can also limit to only a fraction of the k-mers, which significantly reduces memory usage and run time. Reducing the k-mers fraction has only a minor effect on sensitivity as comparisons on a subset of k-mers are generally sufficient for sequence identity estimation.
The --kmers-fraction
option controls the proportion [0-1] of k-mers used in comparisons:
# Process genomes in batches and analyze 10% of k-mers in each genome sequence.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 5000000 --kmers-fraction 0.1
For highly redundant datasets (e.g., hundreds of thousands of nearly identical genomes), the prefilter
step may still pass a large number of genome pairs, increasing both memory usage and runtime. The --max-seqs
option limits the number of target sequences reported for each query genome, reducing the overall number of genome pairs passed to alignment. For each query, --max-seqs
returns up to n sequences that have passed the --min-kmers
and --min-ident
filters, and have the highest sequence identity to query sequence. For example, in a dataset containing 1 million nearly identical genomes, the total number of possible genome pairs is almost 500 billion. Setting --max-seqs 1000
reduces this to 1 billion genome pairs, significantly decreasing memory usage and runtime.
# Limit the number of target sequences to 1000 per query genome.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 100000 --max-seqs 1000
- Features
- Installation
- Quick Start
- Usage
- Optimizing sensitivity and resource usage
-
Use cases
- Classify viruses into species and genera following ICTV standards
- Assign viral contigs into vOTUs following MIUViG standards
- Dereplicate viral contigs into representative genomes
- Calculate pairwise similarities between all-versus-all genomes
- Deduplicate (remove identical sequences) across multiple datasets
- Process large dataset of diverse virus genomes (IMG/VR)
- Process large dataset of highly redundant virus genomes
- Cluster plasmid genomes into pOTUs
- FAQ: Frequently Asked Questions