In order to properly run genotyping, we recommend creating a conda
environment using the following
mamba env create -f env.yaml
conda activate natera_genotyping
This should have all of the available software for running both genotyping and imputation of these samples.
There are several values in config.yaml
that are important for the initial genotyping routines. We detail some of them below:
-
parents
: boolean to indicate if you are only genotyping parents or children separately -
opticall_exec
: location of optiCall for execution -
metadata
: merged metadata file (needs to have a column called "array" and "family_position") -
snp_map
: SNP Map from Natera -
dbsnp_af_tsv
: allele frequency filter from dbSNP for genotyping (to avoid rare PGT-M variants that are difficult to genotype) -
chunks_file
: csv file of number of chunks to split each chromosome into (can be left as "") -
nchunks
: default number of chunks to split a chromosome into (default: 20) -
alleles_file
: Allele update file from https://www.well.ox.ac.uk/~wrayner/strand/ABtoTOPstrand.html (Note: we do not recommend altering these files to preserve continuity) -
strand_file
: File indicating appropriate strand of alleles to reference (TOP) strand -
strand_refalt
: Mapping REF/ALT to strand for allele flips -
hg38_fasta: reference fast from UCSC "/data/rmccoy22/resources/references/GRCh38/primary_assembly/hg38.fa.gz"
-
hg37_hg38_chain: chain file to liftover from hg37 -> hg38
-
hg37_to_hg38_chroms: simple list going from numeric chromosomes to "chr{i}" notation
-
opticall_csvs:
- 2012: File that lists all of the possible CSV files (csvs that contain x,y,b) for each POC array sample
For generating the CSV files, we used the scripts in (utils/
) to combine .xy
and .b
files in a coherent way (note that you may have to change the --snp
argument location)
Following appropriate configuration, we can run the following rule to fully genotype all samples:
snakemake run_genotyping -j24 --cores 48 -p -n
(remove the -n
to actually run this rule). This will generate a set of 25 VCF files (all autosomes + chrX + chrY polymorphisms) from the HumanCytoSNP v12 array.
After all the appropriate samples have been properly genotyped, you can run imputation (on the autosomes + chromosome X) using:
snakemake run_imputation -j24 --cores 48 -p -n
this will use the autosomal and chrX reference panels specified in the config.yaml
file via a manifest. We recommend using the combined HGDP+1KG reference panel
For reference panel construction we used the following commands:
gsutil -m cp gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes_v2/*.bcf* .
gdown 1RJ-HsjiWC-qX8oJkxkoBxEOW_2XtzfXa
bcftools view <bcf_file> | bgzip -@10 > <vcf_file>; tabix -f <vcf_file>
and to generate the full chrX reference panel, we used:
bcftools concat hgdp1kgp_chrX_par1.shapeit5_common.bcf hgdp1kgp_chrX_non_par.full.shapeit5_rare.bcf hgdp1kgp_chrX_par2.shapeit5_common.bcf | bgzip -@12 > hgdp1kgp_chrX.filtered.SNV_INDEL.phased.shapeit5.vcf.gz; tabix -f hgdp1kgp_chrX.filtered.SNV_INDEL.phased.shapeit5.vcf.gz
which helps keep the same nomenclature for the reference panel.
For any questions, please submit an issue or contact @aabiddanda.