-
Notifications
You must be signed in to change notification settings - Fork 4
Subcommand: afs heatmap
Lucas Czech edited this page Apr 10, 2022
·
22 revisions
Create a per-window heatmap of the allele frequency spectrum along the genome.
Usage: grenedalf afs-heatmap [options]
Input SAM/BAM/CRAM | |
---|---|
--sam-file |
TEXT:FILE Excludes: --pileup-file --sync-file --vcf-file Path to a sam/bam/cram file. |
--min-map-qual |
UINT:UINT in [0 - 90]=0 Needs: --sam-file Minimum phred-scaled mapping quality score [0-90] for a read in sam/bam/cram files to be considered. Any read that is below the given value of mapping quality will be completely discarded, and its bases not taken into account. Default is 0, meaning no filtering by base quality qual. |
--min-base-qual |
UINT:UINT in [0 - 90]=0 Needs: --sam-file Minimum phred-scaled quality score [0-90] for a base in sam/bam/cram files to be considered. Bases below this are ignored when computing allele frequencies. Default is 0, meaning no filtering by base quality qual. |
--split-by-rg |
Needs: --sam-file Instead of considering the whole sam/bam/cram file as one large colletion of reads, use the @RG read group tag to split reads. Each read group is then considered a sample. Reads with an invalid (not in the header) read group tag or without a tag are ignored. |
Input (m)pileup | |
--pileup-file |
TEXT:FILE Excludes: --sam-file --sync-file --vcf-file Path to an (m)pileup file. |
--quality-encoding |
TEXT:{sanger,illumina-1.3,illumina-1.5,illumina-1.8,solexa}=sanger Needs: --pileup-file Encoding of the quality scores of the bases in (m)pileup files. Default is "sanger" , which seems to be the most common these days. Both "sanger" and "illumina-1.8" are identical and use an ASCII offset of 33, while "illumina-1.3" and "illumina-1.5" are identical with an ASCII offset of 64 (we provide different names for completeness). Lastly, "solexa" has an offset of 64, but uses a different equation (not phred score) for the encoding. |
--min-phred-score |
UINT:UINT in [0 - 90]=0 Needs: --pileup-file Minimum phred quality score [0-90] for a base in (m)pileup files to be considered. Bases below this are ignored when computing allele frequencies. Default is 0, meaning no filtering by phred quality score. |
Input sync | |
--sync-file |
TEXT:FILE Excludes: --sam-file --pileup-file --vcf-file Path to a sync file, as specified by PoPoolation2. |
Input VCF | |
--vcf-file |
TEXT:FILE Excludes: --sam-file --pileup-file --sync-file Path to a VCF file with per-sample AD (alleleic depth) fields. This assumes that the data that was used to create the VCF file was actually a pool of individuals (e.g., from pool sequencing) for each sample (column) of the VCF file. This expects that the VCF FORMAT field AD is given and contains the counts of the reference and alternative base, which in this context can be interpreted as describing the allele frequencines of each pool of individuals. |
Sample Names | |
--sample-name-list |
TEXT Excludes: --sample-name-prefix Some file types do not contain sample names, such as (m)pileup or sync files. For such file types, sample names can here be provided as either (1) a comma- or tab-separated list, or (2) as a file with one sample name per line, in the same order as samples are in the actual input file. We then use these names in the output and the --filter-samples-include and --filter-samples-exclude options. If not provided, we simply use numbers 1..n as sample names for these files types. Alternatively, use --sample-name-prefix to provide a prefix for this sample numbering. |
--sample-name-prefix |
TEXT Excludes: --sample-name-list Some file types do not contain sample names, such as (m)pileup or sync files. For such file types, this prefix followed by indices 1..n can be used instead to provide unique names per sample that we use in the output and the --filter-samples-include and --filter-samples-exclude options. For example, use "Sample_" as a prefix. If not provided, we simply use numbers 1..n as sample names for these files types. Alternatively, use --sample-name-list to directly provide a list of sample names. |
Filtering | |
--filter-region |
TEXT Genomic region to filter for, in the format "chr", "chr:position", "chr:start-end", or "chr:start..end". |
--filter-region-bed |
TEXT:FILE Genomic regions to filter for, as a BED file. In its simplest form, this file may contain a list of regions, one per line, with tab-separated chromosome and start and end positions. Note that BED uses 0-based positions, and a half-open [) interval for the end position. |
--filter-samples-include |
TEXT Excludes: --filter-samples-exclude Sample names to include (all other samples are excluded); either (1) a comma- or tab-separated list, or (2) a file with one sample name per line. If no sample filter is provided, all samples in the input file are used. The option considers --sample-name-list or --sample-name-prefix for file types that do not contain sample names. |
--filter-samples-exclude |
TEXT Excludes: --filter-samples-include Sample names to exclude (all other samples are included); either (1) a comma- or tab-separated list, or (2) a file with one sample name per line. If no sample filter is provided, all samples in the input file are used. The option considers --sample-name-list or --sample-name-prefix for file types that do not contain sample names. |
Window | |
--window-width |
Required. UINT=0 Width of each window along the chromosome. |
--window-stride |
UINT=0 Stride between windows along the chromosome, that is how far to move to get to the next window. If set to 0 (default), this is set to the same value as the --window-width . |
Settings | |
--resolution |
UINT:POSITIVE=100 Resolution of the spectrum histogram, that is, the number of bins of frequencies. This is hence also the vertical resolution (in pixels) of the resulting images. |
--max-frequency |
FLOAT:(FLOAT in [0 - 1]) AND (POSITIVE)=1 Maximum frequency, that is, the y-axis cutoff; default is 1.0. Frequencies above the maximum will be assinged to the highest bin. When using --spectrum-type folded , consider setting this option to 0.5, as that is the maximum of the folded spectrum. |
--spectrum-type |
TEXT:{folded,unfolded}=unfolded Type of spectrum to compute. That is, which frequencies do the values in the heat map represent: Either the frequency of the alternative allele (unfolded, default; uses the highest count/frequency allele that is not the reference allele as given in the input file, with frequencies in range [0, 1]), or the frequency of the minor allele (folded; uses the allele with the second highest count/frequency after the major (most common) allele, with frequencies in range [0, 0.5]). |
--average-method |
TEXT:{arithmetic,geometric,harmonic,counts}=harmonic If multiple samples are used (present in the file, and not filtered out), either compute the average of the frequencies per sample (using either arithmetic, geometric, or harmonic mean), or sum up the allele counts per sample, and then compute the frequency from that. The former gives each sample the same weight, while in the latter case, samples with more counts (more sequence reads) have a higher influence. If harmonic mean is used, we apply a correction for frequencies of zero, which happens in samples that do not have any alternative/minor allele counts. |
--fold-undetermined-positions |
Only relevant if --spectrum-type unfolded is set. By default, positions with undetermined reference alleles ('N') are skipped in the unfolded spectrum. If however this option is set, these positions are instead folded; that is, we then assume the major allele with the highest count/frequency to be the reference allele. |
--write-individual-bmps |
If set, write individual heatmap files for each chromosome, in bitmap format, in addition to the svg heatmap that contains all (not filtered) chromosomes. |
Output | |
--out-dir |
TEXT=. Directory to write files to |
--file-prefix |
TEXT File prefix for output files. Most grenedalf commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--file-suffix |
TEXT File suffix for output files. Most grenedalf commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--compress |
If set, compress the output files using gzip. Output file extensions are automatically extended by .gz . |
Global Options | |
--allow-file-overwriting |
Allow to overwrite existing output files instead of aborting the command. |
--verbose |
Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
When using this method, please do not forget to cite
Lucas Czech, Moises Exposito-Alonso. Grenedalf: Genome Analyses of Differential Allele Frequencies. Manuscript in preparation, 2021. doi: