natera_genotyping

Overview

Environment

In order to properly run genotyping, we recommend creating a conda environment using the following

mamba env create -f env.yaml
conda activate natera_genotyping

This should have all of the available software for running both genotyping and imputation of these samples.

Configuration

There are several values in config.yaml that are important for the initial genotyping routines. We detail some of them below:

parents: boolean to indicate if you are only genotyping parents or children separately
opticall_exec: location of optiCall for execution
metadata: merged metadata file (needs to have a column called "array" and "family_position")
snp_map: SNP Map from Natera
dbsnp_af_tsv: allele frequency filter from dbSNP for genotyping (to avoid rare PGT-M variants that are difficult to genotype)
chunks_file: csv file of number of chunks to split each chromosome into (can be left as "")
nchunks: default number of chunks to split a chromosome into (default: 20)
alleles_file: Allele update file from https://www.well.ox.ac.uk/~wrayner/strand/ABtoTOPstrand.html (Note: we do not recommend altering these files to preserve continuity)
strand_file: File indicating appropriate strand of alleles to reference (TOP) strand
strand_refalt: Mapping REF/ALT to strand for allele flips
hg38_fasta: reference fast from UCSC "/data/rmccoy22/resources/references/GRCh38/primary_assembly/hg38.fa.gz"
hg37_hg38_chain: chain file to liftover from hg37 -> hg38
hg37_to_hg38_chroms: simple list going from numeric chromosomes to "chr{i}" notation
opticall_csvs:
- 2012: File that lists all of the possible CSV files (csvs that contain x,y,b) for each POC array sample

For generating the CSV files, we used the scripts in (utils/) to combine .xy and .b files in a coherent way (note that you may have to change the --snp argument location)

Genotyping

Following appropriate configuration, we can run the following rule to fully genotype all samples:

snakemake run_genotyping -j24 --cores 48 -p -n

(remove the -n to actually run this rule). This will generate a set of 25 VCF files (all autosomes + chrX + chrY polymorphisms) from the HumanCytoSNP v12 array.

Imputation

After all the appropriate samples have been properly genotyped, you can run imputation (on the autosomes + chromosome X) using:

snakemake run_imputation -j24 --cores 48 -p -n

this will use the autosomal and chrX reference panels specified in the config.yaml file via a manifest. We recommend using the combined HGDP+1KG reference panel

For reference panel construction we used the following commands:

gsutil -m cp gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes_v2/*.bcf* .
gdown 1RJ-HsjiWC-qX8oJkxkoBxEOW_2XtzfXa
bcftools view <bcf_file> | bgzip -@10 > <vcf_file>; tabix -f <vcf_file>

and to generate the full chrX reference panel, we used:

bcftools concat hgdp1kgp_chrX_par1.shapeit5_common.bcf hgdp1kgp_chrX_non_par.full.shapeit5_rare.bcf hgdp1kgp_chrX_par2.shapeit5_common.bcf | bgzip -@12 > hgdp1kgp_chrX.filtered.SNV_INDEL.phased.shapeit5.vcf.gz; tabix -f hgdp1kgp_chrX.filtered.SNV_INDEL.phased.shapeit5.vcf.gz

which helps keep the same nomenclature for the reference panel.

Contact

For any questions, please submit an issue or contact @aabiddanda.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
config		config
profile		profile
utils		utils
workflow		workflow
.gitignore		.gitignore
README.md		README.md
config.yaml		config.yaml
env.yaml		env.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

natera_genotyping

Overview

Environment

Configuration

Genotyping

Imputation

Contact

About

Uh oh!

Releases

Packages

Languages

mccoy-lab/natera_genotyping

Folders and files

Latest commit

History

Repository files navigation

natera_genotyping

Overview

Environment

Configuration

Genotyping

Imputation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages