5 Optimizing sensitivity and resource usage

To reduce memory consumption or improve processing speed, you can use the following three methods, either individually or in combination:

2.1.1 Process genomes in smaller batches

Vclust can lower memory usage by processing the genome dataset in smaller, equally-sized batches. While this approach may slightly increase runtime, it significantly reduces memory consumption:

# Process genomes in batches of 5 million sequences.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 5000000

2.1.2. Analyze only a fraction of k-mers

To speed up filtering, Vclust can analyze only a fraction of the k-mers for each genome. The --kmers-fraction option controls the proportion [0-1] of k-mers used in comparisons:

# Process genomes in batches and analyze 10% of k-mers in each genome sequence.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 5000000 --kmers-fraction 0.1

2.1.3. Limit the number of target sequences per query

For highly redundant datasets (e.g., hundreds of thousands of nearly identical genomes), the prefilter step may still pass a large number of genome pairs, increasing both memory usage and runtime. The --max-seqs option limits the number of target sequences reported for each query genome, reducing the overall number of genome pairs passed to alignment. For each query, --max-seqs returns up to n sequences that have passed the --min-kmers and --min-ident filters, and have the highest sequence identity to query sequence. For example, in a dataset containing 1 million nearly identical genomes, the total number of possible genome pairs is almost 500 billion. Setting --max-seqs 1000 reduces this to 1 billion genome pairs, significantly decreasing memory usage and runtime.

# Limit the number of target sequences to 1000 per query genome.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 100000 --max-seqs 1000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

5 Optimizing sensitivity and resource usage

2.1.1 Process genomes in smaller batches

2.1.2. Analyze only a fraction of k-mers

2.1.3. Limit the number of target sequences per query

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Table of contents

Clone this wiki locally