KmerCamel🐫

Introduction
Prerequisites
Getting started
Detailed instructions
How it works
How to test
Issues
Changelog
Licence
Contact

Introduction

KmerCamel🐫 is a tool for efficiently representing a set of k-mers by a masked superstring.

It is based on the following paper:

Ondřej Sladký, Pavel Veselý, and Karel Břinda: Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv 2023.02.01.526717, 2023. https://doi.org/10.1101/2023.02.01.526717

See supplementary materials of the aforementioned paper for experimental results with KmerCamel🐫.

The computation of masked superstring using KmerCamel🐫 is done in two steps - first a superstring is computed with its default mask and then its mask can be optimized.

The computation of the masked superstring works as follows. KmerCamel🐫 reads an input FASTA file (optionally gziped), retrieves the associated k-mers (with supported $k$ up to 127), and outputs a fasta file with a single record - a masked-cased superstring, which is in the nucleotide alphabet with case of the letters determining the mask symbols. KmerCamel🐫 implements two different algorithms for computing the superstring: global greedy and local greedy. Global greedy produces more compact superstrings and therefore is the default option, but local greedy requires less memory and hence can be more suitable in use cases where memory is the main limitation.

To compute masked superstrings takes about 4-6s / 1M k-mers, which means about 3h to compute masked superstrings for the human genome. The memory consumption on human genome is about 115 GB.

All algorithms can be used to either work in the unidirectional model or in the bidirectional model (i.e. treat $k$-mer and its reverse complement as the same; in this case either of them appears in the result).

Additionally, KmerCamel🐫 can optimize the mask of the superstring via the maskopt subcommand. The implemented mask optimization algorithms are the following:

Minimize the number of 1s in the mask.
Maximize the number of 1s in the mask.
Minimize the number of runs of 1s in the mask.

Prerequisites

GCC
Zlib
GLPK (can be installed via apt-get install libglpk-dev on Ubuntu or brew install glpk on macOS)

Getting started

Installation

Download and compile KmerCamel🐫 by running the following commands:

git clone --recursive https://github.com/OndrejSladky/kmercamel
cd kmercamel && make

Alternatively, you can install KmerCamel from bioconda:

   conda install bioconda::kmercamel

Compression for k-mer set storage

kmercamel compute -k 31 -o ms.msfa yourfile.fa         # Compute MS with the default mask
kmercamel ms2mssep -m mask.m -s superstring.s ms.msfa  # Extract superstring and mask
bzip2 --best mask.m
xz -T1 -9 superstring.s

For a super efficient compression of the superstring (often <2 bits / bp), you use some of the specialized tools based on statistical compression such as GeCo3 or Jarvis3.

k-mer set indexing

Example with FMSI:

kmercamel compute -k 31 -o /dev/null -M mas-opt.msfa yourfile.fa   # Compute MS and the maxone mask
fmsi index -p ms-opt.msfa                                          # Create a k-mer index

Detailed instructions

Examples of computing masked superstrings (compute subcommand):

kmercamel compute -k 31 -o ms.msfa yourfile[.fa|.fa.gz]            # From a (gziped) fasta file, use "-" for stdin
kmercamel compute -k 31 -o ms.msfa -z 2 yourfile.fa                # Represent only k-mers appearing at least z=2 times
kmercamel compute -k 31 -o ms.msfa -u yourfile.fa                  # Treat k-mer and its reverse complement as distinct
kmercamel compute -k 31 -o ms.msfa -M ms-maxone.msfa yourfile.fa   # Also store MS with maximum ones
kmercamel compute -k 31 -o ms.msfa -a streaming yourfile.fa        # Use streaming instead of global for lower memory footprint (likely worse result)

If the input file are simplitigs (or eulertigs), the execution can be significantly speeded up by adding the -S flag. However, note that if -S is used with matchtigs, it may result it unnecessarily long outputs, while using it for unitigs may lead to slow down for certain inputs.

Examples of optimizing masks:

kmercamel maskopt -t maxone -o ms-opt.msfa -k 31 ms.msfa    # Maximize the number of 1s in the mask
kmercamel maskopt -t minone -o ms-opt.msfa -k 31 ms.msfa    # Minimize the number of 1s in the mask
kmercamel maskopt -t minrun -o ms-opt.msfa -k 31 ms.msfa    # Minimize the number of runs of consecutive 1s in the mask.

Format conversions:

kmercamel mssep2ms -m dataset.m -s dataset.s -o dataset.msfa  # M and S -> mask-cased MS in msfa
kmercamel ms2mssep -m dataset.m -s dataset.s dataset.msfa     # Mask-cased MS -> M and S
kmercamel spss2ms -k 31 -o dataset.msfa dataset.rspss         # rSPSS/general fasta to its corresponding MS
kmercamel ms2spss -k 31 -o dataset.rspss dataset.msfa              # Splitting MS in msfa into rSPSS in fa

Compute lower bound on the minimum possible superstring length of a k-mer set:

./kmercamel lowerbound -k 31 yourfile.fa

To view all options for a particular subcommand, run kmercamel <subcommand> -h.

Additionally, KmerCamel🐫 experimentally implements both algorithms in their Aho-Corasick automaton versions. To use them, add AC to the algorithm name. Note that they are much slower than the original versions, but they can handle arbitrarily large ks.

How it works

For details about the algorithms and their implementation, see the Code README.

How to test

To ensure correctness of the results, KmerCamel🐫 has two levels of tests - unit tests and file-specific integration tests.

For integration tests install jellyfish (v2) and add it to PATH.

You can verify all the algorithms for 1 < k < 128 on a S. pneumoniae by running make verify. To run it on another dataset, see the verification script.

You can run the C++ unittests by make cpptest.

To run all the test, simply run make test.

Issues

Please use Github issues.

Changelog

See Releases.

Licence

MIT

Contact

Ondrej Sladky <ondra.sladky@gmail.com>

Name		Name	Last commit message	Last commit date
Latest commit History 381 Commits
.github/workflows		.github/workflows
data		data
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
create-version.sh		create-version.sh
verify.py		verify.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KmerCamel🐫

Introduction

Prerequisites

Getting started

Installation

Compression for k-mer set storage

k-mer set indexing

Detailed instructions

How it works

How to test

Issues

Changelog

Licence

Contact

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

OndrejSladky/kmercamel

Folders and files

Latest commit

History

Repository files navigation

KmerCamel🐫

Introduction

Prerequisites

Getting started

Installation

Compression for k-mer set storage

k-mer set indexing

Detailed instructions

How it works

How to test

Issues

Changelog

Licence

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages