Skip to content

fusemblr is a pipeline wrapper designed for the assembly of complex genomes using nanopore reads and paired-end illumina

License

Notifications You must be signed in to change notification settings

SAMtoBAM/fusemblr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zenodo DOI Anaconda_version Anaconda_platforms Anaconda_downloads Anaconda-Server Badge

fusemblr is a pipeline wrapper designed for the assembly of complex genomes using nanopore reads and paired-end illumina

fusemblr was designed for the Fusarium oxysporum assembly project (hence the name)
The pipeline uses Nanopore (the longer and higher coverage the better) and paired-end illumina reads (PacBio is optional)

Notably: Providing PacBio Hifi had very little impact on the resulting assemblies using our Fusarium oxysporum datasets as we used recent ONT basecalled data, had high coverage and a good subset of long reads.

Easy installation

conda install samtobam::fusemblr

How to run

fusemblr.sh -n nanopore.fq.gz -1 illumina.R1.fq.gz -2 illumina.R2.fq.gz -g 70000000

Required inputs:
-n | --nanopore		Nanopore long reads used for assembly in fastq or fasta format (*.fastq / *.fq) and can be gzipped (*.gz)
-1 | --pair1		Paired end illumina reads in fastq format; first pair. Used for Rataosk polishing and PAQman evaluation. Can be gzipped (*.gz)
-2 | --pair2		Paired end illumina reads in fastq format; second pair. Used for Rataosk polishing and PAQman evaluation. Can be gzipped (*.gz)	
-g | --genomesize	Estimation of genome size, required for downsampling and assembly

Recommended inputs:
-h | --hifi		Pacbio HiFi reads required for assembly polishing with NextPolish2 (Recommended if available)
-t | --threads		Number of threads for tools that accept this option (default: 1)

Optional parameters:
-m | --minsize		Minimum size of reads to keep during downsampling (Default: 5000)
-x | --coverage		The amount of coverage for downsampling (X), based on genome size, i.e. coverage*genomesize (Default: 100)
-v | --minovl		Minimum overlap for Flye assembly,  (Default: Calculated during run as N95 of reads used for assembly)
-w | --weight		The weighting used by Filtlong for selecting reads; balancing the length vs the quality (Default: 5)
-p | --prefix		Prefix for output (default: name of assembly file (-a) before the fasta suffix)
-o | --output		Name of output folder for all results (default: fusemblr_output)
-c | --cleanup		Remove a large number of files produced by each of the tools that can take up a lot of space. Choose between 'yes' or 'no' (default: 'yes')
-h | --help		Print this help message

Pipeline in 6 steps:

1. Downsampling of reads to a designated coverage using Filtlong

    -default is set to 100X (-x); which provided better assemblies compared to the typical 30-50X

2. Polishing of downsampled reads with the paired-end illumina reads using Meryl and Ratatosk correct

    -uses a baseline quality score (-Q) of 90 and therefore assumes mildly recent ONT data (e.g. R10 or high-accuracy basecalling)

3. Genome Assembly

3.a. Assembly withFlye
    -removed the hard coded maximium value for the minimum overlap threshold (previously 10kb)
    -by default the minimum overlap value is automatically provided as the read N95 after polishing
3.b. Assembly with Hifiasm
    -if Hifi reads are provided: uses the --ul option, with both polished ONT and Hifi reads
    -without Hifi: uses the --ont option, with only the polished ONT reads

4. 'Patch' the Flye assembly (target) using the the Hifiasm assembly (query) with Ragtag patch

    -uses a minimum unique alignment length (-f) of 25000 to be conservative during patching

5. Optional: Polishing of assembly with PacBio Hifi and paired-end illumina reads using NextPolish2

6. Filtering (minimum length 10kb), reordering and renaming using Seqkit and awk

Schematic

Following assembly it is recommended that you run PAQman on your resulting assembly to comprehensively check the quality
This can also help you compare any assemblies you have to check for the best.

About

fusemblr is a pipeline wrapper designed for the assembly of complex genomes using nanopore reads and paired-end illumina

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages