sparrowhawk

Short-read assembler for bacterial genomics based on a de Bruijn graph written in Rust.

Disclaimer ⚠️ 🚧

This is a work in progress project. This in particular implies:

Not all the main features we want are yet implemented.
Code might be messy, and not even documented.
General documentation on how to install and use the tool might be short or even missing.
Finding unexpected errors/behaviour or bugs should not be a surprise.
Some features might be partially hardcoded.

These (and potentially other) items will be progressively fixed before version 1.0.

sparrowhawk?

Sparrowhawk was at one time the Archmage of Earthsea. Also, the sparrowhawk (Accipiter nisus) is a bird of prey native to Europe (and the island of Gont).

Description

Note: this repository is for the Rust-based genomic assembler. If you are looking for its web implementation, see sparrowhawk-web.

sparrowhawk aims to be a fast short-read assembler for bacterial genomics. It has been developed taking advantage/inspiration of other Rust-based tools developed by our group (such as ska.rust), as well as others (such as Katome or SKESA).

Current main features:

Currently only support for Illumina paired short reads.
Single k-value, that must be odd and between 3 (not recommended to go below 19/21, due to memory requirements) and 256.
Designed thinking on bacterial genomes (i.e. "small").
Uses a node-based de Bruijn graph built upon petgraph.
Partially parallelised with rayon.
Compilation to WebAssembly targets (currently only wasm32-unknown-unknown) is possible to run the assembler in web projects. For more info, see sparrowhawk-web.

🚧 In-progress future main (not all) features: 🚧

Partial GPU acceleration.
Multi-k support.
Improved error correction and graph collapse logics.

Installation

Currently the only option is to compile from source.

Compilation from source

Development has been done only on x86_64 GNU/Linux-based systems, and most surely will probably stay that way (i.e. no other systems have been tested). To compile our project from source as we did, you will need the rust toolchain installed in your system. Then, you can download the code of the e.g. version v0.1.1 using

git clone --branch v0.1.1 https://github.com/bacpop/sparrowhawk.git

sparrowhawk is designed to compile to run natively as a x86_64 binary, but you can also compile it to the WebAssembly target wasm32-unknown-unknown. You can see below how to do it manually (as we did for development). Check out sparrowhawk-web for an integrated project with Javascript and wasm-pack.

Compilation to x86_64 (default)

Move into the downloaded repository and use cargo to build the project. You can add the --release argument to include some compiler optimisations.

cd sparrowhawk
cargo build --release

If using the --release flag, this should place your compiled binary inside target/release.

Compilation to wasm32-unknown-unknown

For this, you will need to activate the feature wasm with the -F argument and manually set the target. You can add the --release argument to include some compiler optimisations.

cd sparrowhawk
cargo build --release -F wasm --target wasm32-unknown-unknown

If using the --release flag, this should place your compiled binary inside target/wasm32-unknown-unknown/release as sparrowhawk.wasm.

Usage

Here we will only consider the binary compiled for x86_64, refer to sparrowhawk-web for an example of usage of the wasm32-unknown-unknown compilation target.

sparrowhawk can be called later to see the basic options and arguments with

./sparrowhawk

Currently, only the build option is present (apart from help), that allows assemblying the genomes. You can check its arguments running

./sparrowhawk build --help

An example execution could be the following:

./sparrowhawk -f ./reads.tsv -k 31 --threads 1 -v --min-count 5 --output-dir ./ --output-prefix prefix

This will assemble your reads, with k=31 and using only one thread. The minimum repeats of one particular k-mer to be considered are 5 (which is also the default). The output contigs will be written in the current directory as a fasta file called prefix_contigs.fasta, given that we have indicated, using the --output-prefix argument the word "prefix" as prefix. The input files in this case are provided through a reads.tsv tab-separated file, that contains an identifier for your reads and the two file paths separated by a space, i.e. a file that contains this line

IDENTIFIER     /path/to/the/read_1.fastq /path/to/the/read_2.fastq

Alternatively, you could have run:

./sparrowhawk /path/to/the/read_1.fastq /path/to/the/read_2.fastq -k 31 --threads 1 -v --min-count 5 --output-dir ./ --output-prefix prefix

In the same folder as the output FASTA file, the graph before collapsing will be exported in DOT, and GFA versions 1.1 and 2 as prefix_graph.dot, prefix_graph.gfa, and prefix_graph.gfa2 respectively. A histogram of the k-mer frequency spectrum will be saved in the same directory as prefix_kmerspectrum.png. These optional files can be avoided with the corresponding arguments.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
sparrowhawk_logo.png		sparrowhawk_logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sparrowhawk

Disclaimer ⚠️ 🚧

sparrowhawk?

Description

Installation

Compilation from source

Compilation to x86_64 (default)

Compilation to wasm32-unknown-unknown

Usage

About

Uh oh!

Releases

Contributors 2

Languages

License

bacpop/sparrowhawk

Folders and files

Latest commit

History

Repository files navigation

sparrowhawk

Disclaimer ⚠️ 🚧

sparrowhawk?

Description

Installation

Compilation from source

Compilation to x86_64 (default)

Compilation to wasm32-unknown-unknown

Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Languages