Skip to content

scaffold

Michael Alonge edited this page May 19, 2021 · 9 revisions

RagTag Version: v1.1.1

Scaffolding is the process of ordering and orienting draft assembly (query) sequences into longer sequences. Gaps (stretches of "N" characters) are placed between adjacent query sequences to indicate the presence of unknown sequence. RagTag uses synteny with a more contiguous reference genome assembly to scaffold query sequences. This synteny is measured by pairwise mapping of the query and reference assemblies with either Minimap2 (default) or Nucmer. RagTag does not alter input query sequence in any way and only orders and orients sequences, joining them with gaps.

Usage

Reference-guided scaffolding

positional arguments:
  <reference.fa>       reference fasta file (uncompressed or bgzipped)
  <query.fa>           query fasta file (uncompressed or bgzipped)

optional arguments:
  -h, --help           show this help message and exit

scaffolding options:
  -e <exclude.txt>     list of reference headers to ignore [null]
  -j <skip.txt>        list of query headers to leave unplaced [null]
  -f INT               minimum unique alignment length [1000]
  --remove-small       remove unique alignments shorter than -f
  -q INT               minimum mapq (NA for Nucmer alignments) [10]
  -d INT               alignment merge distance [100000]
  -i FLOAT             minimum grouping confidence score [0.2]
  -a FLOAT             minimum location confidence score [0.0]
  -s FLOAT             minimum orientation confidence score [0.0]
  -C                   concatenate unplaced contigs and make 'chr0'
  -r                   infer gap sizes. if not, all gaps are 100 bp
  -g INT               minimum inferred gap size [100]
  -m INT               maximum inferred gap size [100000]

input/output options:
  -o PATH              output directory [./ragtag_output]
  -w                   overwrite intermediate files
  -u                   add suffix to unplaced sequence headers

mapping options:
  -t INT               number of minimap2 threads [1]
  --aligner PATH       aligner executable ('nucmer' or 'minimap2') [minimap2]
  --mm2-params STR     space delimited minimap2 parameters ['-x asm5']
  --nucmer-params STR  space delimted nucmer parameters ['-l 100 -c 500']

scaffolding options

RagTag orders and orients sequences in <query.fa> according to their mappings to <reference.fa>. These files can be uncompressed or bgzipped. Use -e to provide a single column file listing any reference.fa headers that should be ignored during scaffolding (e.g. chr0/chrUn or alt contigs). Similarly, use -j to provide a single column file listing any query.fa headers that should automatically be left unplaced. If an alignment is not entirely unique, at least -f bp of the alignment must be unique to be considered for scaffolding. By default, entirely unique alignments are considered regardless of their length, but this can be disabled with --remove-small. Doing so ensures that only alignments at least -f bp in length are considered for scaffolding. -q sets the minimum Minimap2 mapq score for alignments. For each query sequence, syntenic alignments within -d bp (with respect to the query) of each other are merged into longer alignments.

-i, -a, and -s specify the minimum grouping, location, and orientation confidence scores, respectively. These scores are described in the original publication. Briefly, these scores, between 0 and 1, provide an indication of how ambiguous scaffolding was for each contig given the reference genome alignments. For example, a query sequence that aligns equally well to two distinct reference sequences will receive a grouping confidence score of 0.5. If every alignment for this query sequence is in the reverse strand, it will receive an orientation confidence score of 1.

By default, RagTag appends unplaced query sequences as-is to the end of the output AGP and FASTA files. Use -C to concatenate all unplaced sequences (with gaps for padding) into a single scaffold called chr0. For gap padding generally, RagTag places 100 bp gaps between adjacent query sequences by default. Invoke -r to infer gap sizes from the alignments. The minimum and maximum inferred gap can be adjusted with -g and -m.

input/output options

By default, RagTag places all of the scaffolding output and intermediate files in a directory named ragtag_output , but this can be changed with -o. RagTag will not overwrite intermediate files that already exist in the output directory. This is to save time producing expensive alignment files. It also allows users to manually edit files or replace them with custom files. Users can set -w to overwrite any preexisting files.

Use the -u option to add the "_RagTag" suffix to each sequence in the scaffold output, even unplaced query sequences that have not changed. This ensures AGP compatibility with some external programs/databases. If one wants unplaced query sequences to retain their original header, do not use -u.

mapping options

Use -t to set the number of threads Minimap2 uses for mapping. This option does not apply to Nucmer alignments. If the aligner executable is not in one's PATH, or one would like to use Nucmer instead of Minimap2, use the --aligner option to specify the PATH of the appropriate aligner executable. The --mm2-params and --nucmer-params options allow one to specify custom alignment parameters for Minimap2 and Nucmer, respectively.

Output

All output is in ragtag_output, or whichever directory -o specifies.

ragtag.scaffolds.agp

The ordering and orientations of query sequences in AGP format.

ragtag.scaffolds.fasta

The scaffolds in FASTA format, defined by the ordering and orientations of ragtag.scaffolds.agp. This file is always overwritten with each new ragtag run, regardless of the -w parameter to ensure that it reflects the latest version of ragtag.scaffolds.agp.

ragtag.scaffolds.stats

Summary statistics for the scaffolding process. "placed_sequences" and "placed_bp" provide the number of query sequences and total query bp localized to one of the reference sequences. "unplaced_sequences and "unplaced_bp" provide the number of query sequences and total query bp that were left unplaced. "gap_sequences" and "gap_bp" provide the number of gap sequences and total gap bp.

Glossary

AGP

A standard file format defining the ordering and orienting of query sequences.

alignment/mapping

Here "mapping" refers to obtaining the coordinates of homologous regions between the query and reference sequences. On top of this, "alignment" provides base-level edit information, often encoded in a CIGAR string. RagTag uses Minimap2 mapping by default. Nucmer always produces alignments, and Minimap2 can be directed to produce alignments with the --mm2-params parameters. In the docs, I sometimes use these terms interchangeably.

scaffold

Query sequences, ordered and oriented with gaps between them.

unplaced

This describes query sequences that were not assigned to a reference sequence homolog. These sequences were not "placed" onto any reference sequence. Unplaced query sequences are the same as the original query sequence.

unique (alignments)

There are many ways to define if alignments are "unique". RagTag uses the concept of "unique anchor filtering" first introduced by Nattestad and Schatz, 2016. Each bp in an alignment is unique if it does not overlap any other alignments with respect to the query sequence. Alignments are either entirely composed of unique or non-unique bp, or they have both unique (anchor) and non-unique bp.

Clone this wiki locally