GitHub - SAMtoBAM/PAQman: PAQman is a tool for a reference-free and comprehensive evaluation of genome assemblies

PAQman combines a set of excellent tools (and PAQman scripts) for a reference-free and comprehensive evaluation of genome assemblies

PAQman evaluates 7 important features:
Contiguity (QUAST)
Gene Content (BUSCO)
Completeness (Merqury)
Error rate (Merqury)
Correctness (CRAQ)
Coverage (PAQman (with use of bwa/minimap2 + samtools + bedtools))
Telomerality* (PAQman (with use of seqtk + bedtools))
PAQman is clearly built upon the back of the tools in brackets so please cite them

Apptainer usage

docker pull ghcr.io/samtobam/paqman:latest

Conda installation

conda config --append channels pwwang
conda install samtobam::paqman

Quick run

paqman.sh -a path/to/assembly.fa -l path/to/long-reads.fq.gz

paqman.sh -g assembly.fa -l long-reads.fq.gz -x ont -1 illumina.R1.fq.gz -2 illumina.R2.fq.gz -b eukaryota -w 30000 -s 10000 -r TTAGGG -p assembly -o paqman_output -c yes

Required inputs:
-a | --assembly     Genome assemly in fasta format (*.fa / *.fasta / *.fna) and can be gzipped (*.gz)
-l | --longreads    Long reads used for assembly in fastq or fasta format  (*.fa / *.fasta / *.fna / *.fastq / *.fq) and can be gzipped (*.gz)

Recommended inputs:
-x | --platform     Long-read technology to determine mapping mapping parameters. Choose between 'ont' or 'pacbio-hifi' or 'pacbio-clr' (default: ont)
-b | --buscodb      Name of BUSCO database to be used (default: eukaryota)
-t | --threads      Number of threads for tools that accept this option (default: 1)
-r | --repeat       Telomeric repeat pattern (default: TTAGGG)

Optional parameters:
-1 | --pair1        Paired end illumina reads in fastq format; first pair. Used by Merqury, CRAQ and coverage analysis (Recommended). Can be gzipped (*.gz)
-2 | --pair2        Paired end illumina reads in fastq format; second pair. Used by Merqury, CRAQ and coverage analysis (Recommended). Can be gzipped (*.gz)
-w | --window       Number of basepairs for window averaging for coverage (default: 30000)
-s | --slide        Number of basepairs for the window to slide for coverage (default: 10000)
-p | --prefix       Prefix for output (default: name of assembly file (-a) before the fasta suffix)
-o | --output       Name of output folder for all results (default: paqman_output)
-seq | --sequences	Whether or not to use scaffolds or contigs; provide 'scaffolds' to not break the assembly at N's (default: contigs)
-c | --cleanup      Remove a large number of files produced by each of the tools that can take up a lot of space. Choose between 'yes' or 'no' (default: 'yes')
-h | --help         Print this help message

The Pipeline

The summary output metrics:

Although PAQman looks at 7 features, within these features are many important metrics for complete assembly evaluation
PAQman extracts some of the most informative/important and places them into a summary file
All metrics are detailed below

Summary stats: 'summary_stats.tsv':

Column	Header	Description
01	prefix	prefix given to the output files (-p)
02	assembly	prefix from the fasta file without suffix (-a)
03	quast_#contigs	Number of total contigs
04	quast_#contigs>10kb	Number of contigs >10 kb
05	quast_assembly_size	Total number of basepairs in assembly
06	quast_assembly_N50	Assembly N50
07	quast_assembly_N90	Assembly N90
08	quast_largest_contig	Largest contig in the assembly
09	BUSCO_db	The BUSCO database used to evaluate the assembly
10	BUSCO_total	Total number of BUSCOs in the database
11	BUSCO_complete	BUSCOs identified as complete
12	BUSCO_complete_single	BUSCOs identified as complete and as a single copy
13	BUSCO_fragmented	BUSCOs identified as fragmented
14	BUSCO_missing	BUSCOs not identified
15	merqury_kmer_completeness(%)	A k-mer estimation of completeness
16	merqury_qv(phred)	A k-mer estimation of the genome wide error rate
17	CRAQ_R-AQI(%)	Quality measure from 0-100 based on small regional errors
18	CRAQ_S-AQI(%)	Quality measure from 0-100 based on large structural errors
19	coverage_normal(%)	Percentage of the genome within 2*stdev of the genome wide median
20	telomeric_ends	Number of contig ends with telomeric repeats
21	telomeric_ends(%)	Percentage of contig ends with telomeric repeats
22	t2t_contigs	Number of contigs with telomeric repeats at both ends

*Telomerality:

*I am using the term to describe stats about how many of the assembled contig have reached telomere sequences giving confidence of structural completeness at contig ends in repeat regions
Telomerality is calculated using a few simple steps specific to PAQman

All exact single repeats are identified (seqkit locate --ignore-case -p ${telomererepeat})
Coordinates of single repeats are merged if withint 7 bp (allowing for 1 repeat to deviate mildly)
Only keep regions where at least two consecutive repeats were found (i.e. only keep region > 2*repeat length)
Can find all the coordinates for telomeric regions (including interstitial) in the bed file with explanatory header: 'telomerality/telomeres.bed'
Contig ends are labelled in 3 ways
telomeric: Coordinates for a telomeric repeat are at least within 0.75*length from the end (e.g. a 100 bp long telomeric repeat region with within 75bp of a contig end)
distant: >0.75*length bp away but within 5kb
absent: >5kb from the end or no repeats identified in contig
Can find these classifications (and coordinates/distance from edge etc) for each contig end in the tsv file with explanatory header: 'telomerality/telomeres.classification.tsv'

Note: For the option -r (--repeat); although some repeats are not exact this can still work as the detection scheme allows for inexact repeats. For example 'GGTGTG' works very well for S. cerevisiae, which usually is represented as T(G)*1-3.

Comparing PAQman output across multiple assemblies

PAQman also has a tool (paqplots) to compare and analyse summary files from multiple assemblies
This makes it easier to benchmark tools and parameters using all the variables analysed
Simply provide paqplots with a combined summary file (with the same header) or a list of paths to the summary files

paqplots.sh -s summary_file.tsv -p prefix -o paqplot_output
OR
paqplots.sh -l list_of_summary_files.txt -p prefix -o paqplot_output
	
Required inputs:
-s | --summary     A PAQman summary file with multiple assemblies combined and the same header
-l | --list		   A list of paths to multiple summary files

Optional parameters:
-p | --prefix       Prefix for output (default: paqplot)
-o | --output       Name of output folder for all results (default: paqplot_output)
-h | --help         Print this help message

paqplot will provide two images (alongside the R scripts used to generate them)
First: Figures containing all of the raw variables compared
A radar plot (thanks ggradar!) for stats of percentages and a lollipop plot for all others (due to the wildly different scales)

Second: Another radar plot (thanks ggradar again!) containing a subset of stats and their relative values (i.e. all stats divided by the maximum value for that stat)
In this example, all stats should be maximised except for contig count hence the PACman like shape in blue below

Name		Name	Last commit message	Last commit date
Latest commit History 244 Commits
.github/workflows		.github/workflows
apptainer		apptainer
bin		bin
figures		figures
logo		logo
pub_examples		pub_examples
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
meta.yaml		meta.yaml
post-link.sh		post-link.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apptainer usage

Conda installation

Quick run

The Pipeline

The summary output metrics:

Summary stats: 'summary_stats.tsv':

*Telomerality:

Comparing PAQman output across multiple assemblies

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Languages

License

SAMtoBAM/PAQman

Folders and files

Latest commit

History

Repository files navigation

Apptainer usage

Conda installation

Quick run

The Pipeline

The summary output metrics:

Summary stats: 'summary_stats.tsv':

*Telomerality:

Comparing PAQman output across multiple assemblies

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Languages

Packages