-
Notifications
You must be signed in to change notification settings - Fork 4
5 Optimizing sensitivity and resource usage
To reduce memory consumption or improve processing speed, you can use the following three methods, either individually or in combination:
Vclust can lower memory usage by processing the genome dataset in smaller, equally-sized batches. While this approach may slightly increase runtime, it significantly reduces memory consumption:
# Process genomes in batches of 5 million sequences.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 5000000
To speed up filtering, Vclust can analyze only a fraction of the k-mers for each genome. The --kmers-fraction
option controls the proportion [0-1] of k-mers used in comparisons:
# Process genomes in batches and analyze 10% of k-mers in each genome sequence.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 5000000 --kmers-fraction 0.1
For highly redundant datasets (e.g., hundreds of thousands of nearly identical genomes), the prefilter
step may still pass a large number of genome pairs, increasing both memory usage and runtime. The --max-seqs
option limits the number of target sequences reported for each query genome, reducing the overall number of genome pairs passed to alignment. For each query, --max-seqs
returns up to n sequences that have passed the --min-kmers
and --min-ident
filters, and have the highest sequence identity to query sequence. For example, in a dataset containing 1 million nearly identical genomes, the total number of possible genome pairs is almost 500 billion. Setting --max-seqs 1000
reduces this to 1 billion genome pairs, significantly decreasing memory usage and runtime.
# Limit the number of target sequences to 1000 per query genome.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 100000 --max-seqs 1000
- Features
- Installation
- Quick Start
- Usage
- Optimizing sensitivity and resource usage
-
Use cases
- Classify viruses into species and genera following ICTV standards
- Assign viral contigs into vOTUs following MIUViG standards
- Dereplicate viral contigs into representative genomes
- Calculate pairwise similarities between all-versus-all genomes
- Deduplicate (remove identical sequences) across multiple datasets
- Process large dataset of diverse virus genomes (IMG/VR)
- Process large dataset of highly redundant virus genomes
- Cluster plasmid genomes into pOTUs
- FAQ: Frequently Asked Questions