Skip to content

mccoy-lab/natera_genotyping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

natera_genotyping

Overview

Environment

In order to properly run genotyping, we recommend creating a conda environment using the following

mamba env create -f env.yaml
conda activate natera_genotyping

This should have all of the available software for running both genotyping and imputation of these samples.

Configuration

There are several values in config.yaml that are important for the initial genotyping routines. We detail some of them below:

  • parents: boolean to indicate if you are only genotyping parents or children separately

  • opticall_exec: location of optiCall for execution

  • metadata: merged metadata file (needs to have a column called "array" and "family_position")

  • snp_map: SNP Map from Natera

  • dbsnp_af_tsv: allele frequency filter from dbSNP for genotyping (to avoid rare PGT-M variants that are difficult to genotype)

  • chunks_file: csv file of number of chunks to split each chromosome into (can be left as "")

  • nchunks: default number of chunks to split a chromosome into (default: 20)

  • alleles_file: Allele update file from https://www.well.ox.ac.uk/~wrayner/strand/ABtoTOPstrand.html (Note: we do not recommend altering these files to preserve continuity)

  • strand_file: File indicating appropriate strand of alleles to reference (TOP) strand

  • strand_refalt: Mapping REF/ALT to strand for allele flips

  • hg38_fasta: reference fast from UCSC "/data/rmccoy22/resources/references/GRCh38/primary_assembly/hg38.fa.gz"

  • hg37_hg38_chain: chain file to liftover from hg37 -> hg38

  • hg37_to_hg38_chroms: simple list going from numeric chromosomes to "chr{i}" notation

  • opticall_csvs:

    • 2012: File that lists all of the possible CSV files (csvs that contain x,y,b) for each POC array sample

For generating the CSV files, we used the scripts in (utils/) to combine .xy and .b files in a coherent way (note that you may have to change the --snp argument location)

Genotyping

Following appropriate configuration, we can run the following rule to fully genotype all samples:

snakemake run_genotyping -j24 --cores 48 -p -n 

(remove the -n to actually run this rule). This will generate a set of 25 VCF files (all autosomes + chrX + chrY polymorphisms) from the HumanCytoSNP v12 array.

Imputation

After all the appropriate samples have been properly genotyped, you can run imputation (on the autosomes + chromosome X) using:

snakemake run_imputation -j24 --cores 48 -p -n

this will use the autosomal and chrX reference panels specified in the config.yaml file via a manifest. We recommend using the combined HGDP+1KG reference panel

For reference panel construction we used the following commands:

gsutil -m cp gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes_v2/*.bcf* .
gdown 1RJ-HsjiWC-qX8oJkxkoBxEOW_2XtzfXa
bcftools view <bcf_file> | bgzip -@10 > <vcf_file>; tabix -f <vcf_file>

and to generate the full chrX reference panel, we used:

bcftools concat hgdp1kgp_chrX_par1.shapeit5_common.bcf hgdp1kgp_chrX_non_par.full.shapeit5_rare.bcf hgdp1kgp_chrX_par2.shapeit5_common.bcf | bgzip -@12 > hgdp1kgp_chrX.filtered.SNV_INDEL.phased.shapeit5.vcf.gz; tabix -f hgdp1kgp_chrX.filtered.SNV_INDEL.phased.shapeit5.vcf.gz

which helps keep the same nomenclature for the reference panel.

Contact

For any questions, please submit an issue or contact @aabiddanda.

About

Genotyping + imputation workflow for parental data in Natera Spectrum + Anora datasets

Resources

Stars

Watchers

Forks

Packages

No packages published