Skip to content

neurogenetics/PD_SV_SRS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Genome-wide analysis of Structural Variants in Parkinsons Disease

Written By: Kimberley Billingsley, Cornelis Blauwendraat, Mike Nalls

Last Updated: June 2022

Quick Description: Detect and genotype SVs in the LNG PD genomes and run a GWAS with the GATK-SV bi-allelic autosomal 1% FDR SV calls.

Link to Manuscript

https://docs.google.com/document/d/1g9Kx_FMqsf_kEZg_EhwJ-wuEjxx0I0hTnV9WztJpoJ4/edit?usp=sharing

Background:

Parkinson’s disease (PD) is a complex neurodegenerative disorder, affecting approximately one million individuals in the USA alone. A significant proportion of risk for PD is driven by genetics. Despite this, the majority of the common genetic variation that contributes to PD is unknown, in-part because previous genetic studies have focussed solely on the contribution of single nucleotide variants (SNVs). Structural variants (SVs) represent a significant source of genetic variation in the human genome. Yet, SVs have not been cataloged on a genome-wide scale, and their contribution to the risk of PD remains unknown. For this analysis study we 1) leveraged the GATK-SV pipeline to detect and genotype SVs in 7772 short-read sequencing (SRS) data and 2) generated a subset of matched whole-genome Oxford Nanopore (ONT) long-read sequencing (LRS) data from the PPMI cohort to allow for comprehensive SV confirmation. We detected, genotyped, and tested 3154 “high-confidence” SV loci, representing over 412 million nucleotides of non-reference genetic variation. Utilizing the LRS data we validated three SVs that may drive the association signals at known PD risk loci, including a 2kb deletion within the gene LRRN4. Further we confirm that the majority of the SVs in the human genome cannot be not detected using SRS alone, encompassing on average around 4 million nucleotides of inaccessible sequence per genome. Therefore, although these data provide the most comprehensive survey of the contribution of SVs to the genetic risk of PD to date, this study highlights the need for large-scale long-read datasets to fully elucidate the role of SV in PD.

Structure of README:

1.Short-Read Sequencing (SRS) GATK-SV SV GWAS

1.1. Variant QC = Extract "PASS" SV

1.2. samples QC= Remove samples that are non-European, do not carry known PD mutations and that have missing pheno/age/sex

1.3. Make SV specific PCs

1.4. Run stepwise logistic regression to find associated PCs with SNV and SV

1.5. Run GATK-SV risk PD GWAS

1.6. Plot Manhatten (on the hudson) Vs Summary stats of the Nalls 2019 SNV Meta-analysis 5

1.7. Plot QQ-plot

2. Linkage Disequilibruim Analysis SNV & SV

2.1. Calc R2 for PR risk SNPs

2.2. Pull out LD SV of interest

3. Long-Read Sequencing (LRS) SV analysis with Sniffles2

3.1. Run Sniffle2 on hg38 mapped LRS bam

3.2. Run variant QC with SURVIVOR

4. SRS SV confirmation with LRS and Truvari

4.1. Create SRS GATK-SV vcf per sample

4.2. Run Truvari comparing SRS and LRS

1.Short-Read Sequencing (SRS) GATK-SV SV GWAS

1.1. Variant QC = Extract "PASS" SV

module load bcftools
bcftools view -f PASS gatksv_1percent_samples_for_gwas.vcf > pass_gatksv_1percent_samples_for_gwas.vcf

1.2. samples QC= Remove samples that are non-European, do not carry known PD mutations and that have missing pheno/age/sex

bcftools view -S sample_IDs_gwas.txt --force-samples pass_gatksv_1percent_samples_for_gwas.vcf -> clean_gatksv_1percent_samples_for_gwas.vcf
echo "How many samples are left for GWAS"
bcftools query -l  clean_gatksv_1percent_samples_for_gwas.vcf | wc -l

1.3. Make SV specific PCs

module load flashpca
plink --vcf clean_gatksv_1percent_samples_for_gwas.vcf --maf 0.01 --geno 0.01 --hwe 5e-6 --autosome --make-bed --out FILENAME_2 
plink --bfile FILENAME_2 --indep-pairwise 1000 10 0.02 --autosome --out pruned_data
plink --bfile FILENAME_2 --extract pruned_data.prune.in --make-bed --out FILENAME_3 
flashpca --bfile FILENAME_3 --suffix _SV_filter_pruned_forPCA.txt --numthreads 19
## in R
SV.PC <-read.table(file="pcs_SV_filter_pruned_forPCA.txt", header =T)
SNP.PC <-read.table(file="pcs_filter_pruned_forPCA.txt", header =T)
old.pheno <- read.table(file="pheno_for_tr.association_tr_snp_pcs.txt", header =T)
colnames(SV.PC) <- c("FID", "IID", "SV_PC1", "SV_PC2", "SV_PC3", "SV_PC4", "SV_PC5", "SV_PC6", "SV_PC7", "SV_PC8", "SV_PC9", "SV_PC10")
SVandSNPpc <- merge(SV.PC, SNP.PC, by="FID")
old_covar_needed <- old.pheno[c("FID","IID","SEX","PHENO", "AGE_ANALYSIS")] 
new.pheno <- merge(SVandSNPpc, old_covar_needed, by="FID")
new_covar <- new.pheno[c("FID","IID","SEX","PHENO", "AGE_ANALYSIS","SV_PC1","SV_PC2", "SV_PC3", "SV_PC4", "SV_PC5", "SV_PC6", "SV_PC7", "SV_PC8", "SV_PC9", "SV_PC10", "SNV_PC1", "SNV_PC2", "SNV_PC3", "SNV_PC4", "SNV_PC5", "SNV_PC6", "SNV_PC7", "SNV_PC8", "SNV_PC9", "SNV_PC10")] 
write.table(new_covar, file="pheno_for_SV.association_SV_snp_pcs.tab",sep="\t", row.names = F, quote=F)

1.4. Run stepwise logistic regression to find associated PCs with SNV and SV

## in R 

##Run Step wise regression to indentify what covariates are associated with PD(pheno) so we know what to include in association analysis 

library(MASS)
PCs <- read.table(file="pheno_for_SV.association_SV_snp_pcs.tab", header =T)
full.model <- lm(PHENO ~ SEX + AGE_ANALYSIS + SNV_PC1 + SNV_PC2 + SNV_PC3 + SNV_PC4 + SNV_PC5 + SNV_PC6 + SNV_PC7 + SNV_PC8 + SNV_PC9 + SNV_PC10 + SV_PC1 + SV_PC2 + SV_PC3  + SV_PC4 + SV_PC5 + SV_PC6 + SV_PC7 + SV_PC8 + SV_PC9 + SV_PC10, data = PCs)
summary(full.model)
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both", trace = FALSE)
summary(step.model)
## make a reduced covariate file based on all the siginicant PCs shown in the list 
for_reduced_covar <- PCs[c("FID","IID","SEX","AGE_ANALYSIS","SNV_PC1","SNV_PC2","SNV_PC4","SNV_PC5","SNV_PC6","SNV_PC7","SNV_PC8","SNV_PC9","SNV_PC10","SV_PC1","SV_PC2","SV_PC3","SV_PC4","SV_PC5","SV_PC6","SV_PC8","SV_PC9","SV_PC10")]
write.table(for_reduced_covar, file="1percent_all_reduced_covar_sv.tab",sep="\t", row.names = F, quote=F)

1.5. Run GATK-SV risk PD GWAS

module load plink
plink --vcf clean_gatksv_1percent_samples_for_gwas.vcf --maf 0.01 --geno 0.01 --hwe 5e-6 --autosome --make-bed --out 1percent_clean_PD_gaksv_pass_bnd_maf_0.01_hwe_5e6_geno0.01
module load plink/2.0_alpha_1_final
plink2 --covar 1percent_all_reduced_covar_sv.tab  --covar-variance-standardize --glm hide-covar --out gatksv_gwas_clean_1percent_calls  --bfile 1percent_clean_PD_gaksv_pass_bnd_maf_0.01_hwe_5e6_geno0.01  --pheno risk_pheno.tab
plink2 --adjust-file gatksv_gwas_clean_1percent_calls.PHENO.glm.logistic test=ADD --out adjusted. gatksv_gwas_clean_1percent_calls.PHENO.glm.logistic

1.6. Plot Manhatten (on the hudson) Vs Summary stats of the Nalls 2019 SNV Meta-analysis 5

awk '{print $3, $1, $2, $12}'gatksv_gwas_clean_1percent_calls.PHENO.glm.logistic > for_plotting_man_gatk_gwas.txt
#manually change header to SNP	CHR	POS	pvalue
## Do this in R/4.0
devtools::install_github('anastasia-lucas/hudson')
library(hudson)
meta5_plot <- read.table(file="/data/CARD/PD/GAK_SV/GWAS/plot_meta_5_hg38.tab", header =T)
gatksv_plot <- read.table(file="/data/CARD/PD/GATK-SV_22/GWAS/for_plotting_man_gatk_gwas.txt", header =T)
gangstr_plot <- read.table(file="/data/CARD/PD/GAK_SV/GWAS/plot_gangstr_hg38.tab", header =T)
gmirror(top=gatksv_plot, bottom=meta5_plot, tline=5.34E-6, bline=5E-8,background = "white", file = "new_gatksv_vs_meta_5_no23_HudsonPlot",highlight_p = c(0.05/nrow(gatksv_plot),0.05/nrow(meta5_plot)),highlighter="green3",toptitle="GAK-SV genome-wide association analysis", bottomtitle = " Meta 5 SNP genome-wide association analysis", hgt = 14, hgtratio = 0.5, wi = 24, res = 300)

1.7.Plot QQ-plot

vignette('qqman')
gaksv_plot <- read.table(file="for_plotting_man_gatk_gwas.txt", header =T)
qq(gaksv_plot$pvalue, main = "Q-Q plot of GATK_SV GWAS p-values", xlim = c(0, 10), ylim = c(0, 10), pch = 18, col = "blue4", cex = 1.5, las = 1)
## Lambda value
data = read.table("for_plotting_man_gatk_gwas.txt",header=T)
p <- data$pvalue
n <- length(data$pvalue)
x2obs <- qchisq(as.numeric(as.character(p)), 1, lower.tail = FALSE)
x2exp <- qchisq((1:n)/n, 1, lower.tail = FALSE)
lambda <- median(x2obs)/median(x2exp)
lambda

2. Linkage Disequilibruim Analysis SNV & SV

2.1. Calc R2 for PR risk SNPs

module load plink/1.9
for f in `cat PD_risk.SNVs.txt`
    do
         plink --bfile  1percent_clean_PD_gaksv_pass_bnd_maf_0.01_hwe_5e6_geno0.01 --ld-snp ${f} --out ${f}.risk.ld.pairs --r2 inter-chr dprime --silent 
    done

2.2. Pull out SV of interest

## Lets see how many SV are in LD
grep "PD_" *risk.ld.pairs.ld*

grep "PD_" risk.ld.pairs.ld > GATK_SV_in_ld_PD_risk_snvs_inter_chr_dprime.txt

 awk -v OFS='\t' '{print $7}' GATK_SV_in_ld_PD_risk_snvs_inter_chr_dprime.txt > GATK_SV_ID_in_ld_PD.txt

3.Long-Read Sequencing (LRS) SV analysis with Sniffles2

3.1. Run Sniffle2 on hg38 mapped LRS bam

module load sniffles/2.0.3
sniffles -i PPMI_3404_BLOOD_merged_sorted.bam --tandem-repeats  human_GRCh38_no_alt_analysis_set.bed -v PPMISI3404_sniffles2_tandemrepeats.vcf
sniffles -i PPMI_4018_BLOOD_merged_sorted.bam --tandem-repeats  human_GRCh38_no_alt_analysis_set.bed -v PPMISI4018_sniffles2_tandemrepeats.vcf
sniffles -i PPMI_3223_BLOOD_merged_sorted.bam --tandem-repeats  human_GRCh38_no_alt_analysis_set.bed -v PPMISI3223_sniffles2_tandemrepeats.vcf
sniffles -i PPMI_3173-BLOOD_merged_sorted.bam --tandem-repeats  human_GRCh38_no_alt_analysis_set.bed -v PPMISI3173_sniffles2_tandemrepeats.vcf
sniffles -i PPMI_3150_BLOOD_merged_sorted.bam --tandem-repeats  human_GRCh38_no_alt_analysis_set.bed -v PPMISI3150_sniffles2_tandemrepeats.vcf
sniffles -i PPMI_4011_BLOOD_merged_sorted.bam --tandem-repeats  human_GRCh38_no_alt_analysis_set.bed  -v PPMISI4011_sniffles2_tandemrepeats.vcf
sniffles -i PPMI_PPMI3951_Blood_test_merged_sorted.bam --tandem-repeats  human_GRCh38_no_alt_analysis_set.bed -v PPMISI3951_sniffles2_tandemrepeats.vcf
sniffles -i PPMI_PPMI3469_Blood_test_merged_sorted.bam --tandem-repeats  human_GRCh38_no_alt_analysis_set.bed -v PPMISI3469_sniffles2_tandemrepeats.vcf

3.2. Run variant QC with SURVIVOR

### The number of SV is still high per genome because we need to do more QC to remove short reads 
module load survivor 
# Remove SV > 50bp with SURVIVOR 
for f in `cat ppmi_samples_lrs_blood.txt`
    do
    SURVIVOR filter ${f}_sniffles2_tandemrepeats.vcf NA 50 -1 -1 -1 ${f}_no_50bpSV.vcf
    done
for f in `cat ppmi_samples_lrs_blood.txt`
    do
    grep -vc '#' ${f}_no_50bpSV.vcf
    done 
 for f in `cat ppmi_samples_lrs_blood.txt`
    do
    SURVIVOR stats ${f}_no_50bpSV.vcf -1 -1 -1 ${f}_no_50bpSV.sum.stats
    done
for f in `cat ppmi_samples_lrs_blood.txt`
    do
    bgzip -c  ${f}_no_50bpSV.vcf  > ${f}_no_50bpSV.vcf.gz
    tabix -p vcf ${f}_no_50bpSV.vcf.gz
done

4. SRS SV confirmation with LRS and Truvari

4.1. Create SRS GATK-SV vcf per sample

module load bcftools
for f in `cat ppmi_samples_lrs_blood.txt`
    do
    bcftools view -s ${f} clean_gatksv_1percent_samples_for_gwas.vcf > ${f}_pass_1percent_sv.vcf
    done
 for f in `cat ppmi_samples_lrs_blood.txt`
    do       
        bcftools view ${f}_pass_1percent_sv.vcf --min-ac=1 > carriers_${f}_pass_1percent_sv.vcf
        bgzip -c carriers_${f}.pass_1percent_sv.vcf > carriers_${f}.pass_1percent_sv.vcf.gz
        tabix -p vcf carriers_${f}.pass_1percent_sv.vcf.gz  
    done

4.2. Run Truvari comparing SRS and LRS

## GATK-SV "PASS" SV
source  /data/CARD/PD/GAK_SV/GWAS/GATK_ONT_VAPOR/vapor/conda/etc/profile.d/conda.sh
conda activate truvari
for f in `cat ppmi_samples_lrs_blood.txt`
    do
        truvari bench -b ${f}_no_50bpSV.vcf.gz  -c carriers_${f}.pass_1percent_sv.vcf.gz  --pctsim=0 -f /fdb/GENCODE/Gencode_human/release_38/GRCh38.primary_assembly.genome.fa -o ${f}_truvari_repeat_filteredLRSSV
    done

About

GATK-SV manuscript analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published