-
Notifications
You must be signed in to change notification settings - Fork 4
Output
Most commands offer options to specify the output file paths and names, namely --out-dir
, --file-prefix
, --file-suffix
, and --compress
. Generally, the command name (such as fst
) is used as the base name for the output files, which are then stored at
<out-dir>/[file-prefix]<base-name>[file-suffix].<ext>[.gz]
using the correct extension in each case, as well as .gz
if the output is compressed. The prefix and suffix hence allow to distinguish the output files for different input files, if given. The output directory defaults to the current directly from where the command is being run.
At the minimum, we need to write the per-window results of the metric being computed. In addition, by default, we also output statistics (counts) of the results of each filtering step, as this might be useful for downstream normalization steps.
The currently implemented commands write their results in a table format with the following columns:
-
chrom
,start
,end
: The chromosome name, and the start and end position of the window for each computed value. Both loci are measured in base pairs, one-based and inclusive. -
total.*
: These columns contain counts of the effects of the filtering that is applied to the total of all samples at each position in the window:-
total.masked
: How many loci were masked out by the mask file (if provided)? -
total.missing
: How many loci were missing in the input data? -
total.empty
: How many loci were there where the per-sample filters (see below) caused all samples to be filtered out, so that only an "empty" locus remained? -
total.numeric
: How many loci were filtered out due to any of the numerical filters that are applied to the total (to all samples at the locus)? -
total.invariant
: How many loci (after passing all the above filters) are found to be invariant, i.e., only contain non-zero counts at one nucleotide? This is the number of valid high-quality positions that are not SNPs. -
total.passed
: How many loci passed all filters, and are hence SNPs considered in the computation?
-
-
sample.*
: For each sample (or sample pair in case of FST), we output a set of columns with counts of the effects of the filtering that is applied to each of them:-
sample.missing
: How many samples were missing in the input data? -
sample.numeric
: How many samples were filtered out due to any of the numerical per-sample filters? -
sample.passed
: How many samples passed all filters and were hence used in the computation? -
sample.value
: Finally, the actual value that we are interested in, wherevalue
is replaced by the name of the specific metric.
-
The sample
part of the per-sample columns is replaced by the sample name, or for FST, by the two sample names of each pair of samples in the format sample_1:sample_2.*
. In case of FST, the per-sample missing
and numeric
columns are incremented for each pair where at least one of the two samples was missing or failed a numeric filter.
Details on the column order and counts.
In each case, per sample and for the total, the sum of the respective columns adds up to the number of samples or positions in the window that were processed.
The total
columns are in front of the sample
columns for convenience, so that the columns that describe the whole position come first. This however differs from the order in which they are applied, as we first filter each sample individually, before then running the total filters.
For each sample, and for the total, filters are applied in the order in which their respective options are listed in the commands. This order, as well as all additional filter steps that do not correspond to user-provided options (such as the total.empty
) are reflected in the column order here as well. For instance, we first apply the mask, then check for missing data, then for empty samples, then apply the numeric filters. If a position fails one of those steps, the respective counter in the column is incremented, and the position does not go through any additional filter steps. Hence, per position and per sample, exactly one total
or sample
column value is incremented.
This is a lot of output, but necessary to capture the full detail of the computation. This allows for instance custom downstream normalization steps as needed. The count columns can however be deactivated with --no-extra-columns
if they are not needed for downstream steps. In that case, only the chrom
, start
, end
columns, as well as the per sample (or sample pair) value
columns are written.