Skip to content
Lucas Czech edited this page Jul 16, 2024 · 2 revisions

Output files

Most commands offer options to specify the output file paths and names, namely --out-dir, --file-prefix, --file-suffix, and --compress. Generally, the command name (such as fst) is used as the base name for the output files, which are then stored at

<out-dir>/[file-prefix]<base-name>[file-suffix].<ext>[.gz]

using the correct extension in each case, as well as .gz if the output is compressed. The prefix and suffix hence allow to distinguish the output files for different input files, if given. The output directory defaults to the current directly from where the command is being run.

Output table columns

At the minimum, we need to write the per-window results of the metric being computed. In addition, by default, we also output statistics (counts) of the results of each filtering step, as this might be useful for downstream normalization steps.

The currently implemented commands write their results in a table format with the following columns:

  • chrom, start, end: The chromosome name, and the start and end position of the window for each computed value. Both loci are measured in base pairs, one-based and inclusive.
  • total.*: These columns contain counts of the effects of the filtering that is applied to the total of all samples at each position in the window:
    • total.masked: How many loci were masked out by the mask file (if provided)?
    • total.missing: How many loci were marked as missing in the input data?
    • total.empty: How many loci were there where the per-sample filters (see below) caused all samples to be filtered out, so that only an "empty" locus remained?
    • total.numeric: How many loci were filtered out due to any of the numerical filters that are applied to the total (to all samples at the locus)?
    • total.invariant: How many loci (after passing all the above filters) are found to be invariant, i.e., only contain non-zero counts at one nucleotide? This is the number of valid high-quality positions that are not SNPs.
    • total.passed: How many loci passed all filters, and are hence SNPs considered in the computation?
  • sample.*: For each sample (or sample pair in case of FST), we output a set of columns with counts of the effects of the filtering that is applied to each of them:
    • sample.missing: How many loci across the sample were marked as missing in the input data?
    • sample.numeric: How many samples were filtered out due to any of the numerical per-sample filters?
    • sample.passed: How many samples passed all filters and were hence used in the computation?
    • sample.value: Finally, the actual value that we are interested in, where value is replaced by the name of the specific metric.

The sample part of the per-sample columns is replaced by the sample name, or for FST, by the two sample names of each pair of samples in the format sample_1:sample_2.*. In case of FST, the per-sample missing and numeric columns are incremented for each pair where at least one of the two samples was missing or failed a numeric filter.

Details on the column order and counts

In each case, per sample and for the total, the sum of the respective columns adds up to the number of samples or positions in the window that were processed. The total columns are in front of the sample columns for convenience, so that the columns that describe the whole position come first. This however differs from the order in which they are applied, as we first filter each sample individually, before then running the total filters.

For each sample, and for the total, filters are applied in the order in which their respective options are listed in the commands. This order, as well as all additional filter steps that do not correspond to user-provided options (such as the total.empty) are reflected in the column order here as well. For instance, we first apply the mask, then check for missing data, then for empty samples, then apply the numeric filters. If a position fails one of those steps, the respective counter in the column is incremented, and the position does not go through any additional filter steps. Hence, per position and per sample, exactly one total or sample column value is incremented.

Note that the missing column by default counts the number of loci that are explicityly marked as missing in the file formats that support that. By default, it does not count the number of positions for which there is no data at all (simply absent in the input), as we do not know if those are truly missing, or just omitted. If you want those to be counted as well in the missing columns, use the option --make-gapless, which fills in all positions for which there is no data with a "missing" dummy.

This is a lot of output, but necessary to capture the full detail of the computation. This allows for instance custom downstream normalization steps as needed. The count columns can however be deactivated with --no-extra-columns if they are not needed for downstream steps. In that case, only the chrom, start, end columns, as well as the per sample (or sample pair) value columns are written.

Clone this wiki locally