cybrBSA

R Package for analyzing sequencing data for multilevel Bulk Segregant Analysis experiment. For more information, please see pre-print at doi: or our epicQTL github. Please install by running the following in R:

#install.packages("devtools")

require(devtools)
install_github("cbuzby/cybrBSA")

Pre-processing

Convert VCF to GATK table (pre-processing for cybrBSA)

cybrBSA takes in a data frame of variants (REF and ALT alleles) and their counts per sequencing bulk.

Below is the bash script to convert the VCF, keeping specific fields listed after -GF. AD (allele depth) is necessary for all analysis, as this is the read count to analyze in the glm. GQ (genome quality) can be used as a filter in the cybrQualityFilter() function. DP and PL are optional. myfile is the name of the VCF to convert.

#!/bin/bash

module load gatk/4.2.0.0

gatk VariantsToTable \
     -V ${myfile} \
     -F CHROM -F POS -F REF -F ALT \
     -GF AD -GF DP -GF GQ -GF PL \
     -O ${myfile}.output.table

Analysis

Variant Table Processing

All data to test is in the folder TestFiles/. Please download all to use.

The following file is a truncated version of the raw data which includes only Chr XI. For the full dataset, please visit epicQTL and replace XI.SortedCat.vcf.output.table with unzipped rawdata

mydatatotest = "XI.SortedCat.vcf.output.table"

cybrInputGATKTable(mydatatotest) %>% mutate(Coverage = as.numeric(AD.REF) + as.numeric(AD.ALT)) %>%
  select(POS, CHROM, Dataset, GQ, AD.REF, AD.ALT, Coverage) -> rawdata

If there are two alternate alleles compared to the reference, provide a parental alleles file (Oak and Wine S cerevisiae are loaded automatically). Ensure that parental alleles (Oak_vcf.txt and Wine_vcf.txt) are in the same folder as this script, and then run the following to call the parent of origin for each ALT allele. This file should have columns for CHROM, POS, and the identity of the alternate.

parentSNPids <- cybrConvertParentalAlleles(Truncate = TRUE)

rawdata %>% 
  merge(parentSNPids) %>% 
  filter(grepl("HNGLVDRXY", Dataset)) %>% #Filter for smaller dataset
  mutate(REFW = as.numeric(Type == "Wine"), REFO = as.numeric(Type == "Oak")) %>%
  group_by(Dataset, CHROM, POS) %>%
  mutate(Wine = max(REFW * as.numeric(AD.REF), REFO * as.numeric(AD.ALT)),
         Oak = max(REFW * as.numeric(AD.ALT), REFO * as.numeric(AD.REF))) %>%
  select(Dataset, POS, CHROM, Coverage, Wine, Oak) %>%
  pivot_longer(c(Oak, Wine), names_to = "Allele", values_to = "Reads") -> rawdata_called

Smooth Data

Next, smooth data using cybr_weightedgauss(), which will apply a weighted gaussian that uses n/10 as the standard deviation.

rawdata_called %>% group_by(Dataset, CHROM, Allele) %>% arrange(POS) %>%
  reframe(POS = POS, 
          SmoothCount = ceiling(frollapply(Reads, n = 200, FUN = cybr_weightedgauss, align = "center"))
          ) -> rawdata_smoothed

Convert factors and contrasts

Formatting output.table data frame for consistent analysis; convert coefficients to factors. Here we select only the HNGLVDRXY sequencing bulk for demonstration.

# Pick out the specific groups of the dataset
rawdata_smoothed %>% 
    mutate(Dataset = gsub("HNGLVDRXY_n01_CuSO4_CSSI_", "", Dataset)) %>%
    mutate(Dataset = gsub(".fastq", "", Dataset)) %>%
    mutate(Dataset = gsub("Unselected", "Unselected_", Dataset)) %>%
    mutate(Dataset = gsub("Selected", "Selected_", Dataset)) %>%
  separate(Dataset, into = c("Bulk", "Parent"), sep = "_") %>%
  na.omit() %>% 
  mutate_if(is.character, as.factor) -> rawdata_glm_prep

Often, the contrasts of the coefficient factors might need to be adjusted, depending on the question interrogated. We ensure that the unselected bulk is labeled 0 and the selected 1, for ease of interpretation of our results.

contrasts(rawdata_glm_prep$Parent) <- matrix(c(-0.5, 0.5))
contrasts(rawdata_glm_prep$Bulk) <- matrix(c(1,0))

Calculate z-scores for glm

First, isolate a single position and run the glm or glmer to verify the number of output coefficients and labels to include:

test <- rawdata_glm_prep[rawdata_glm_prep$POS == min(rawdata_glm_prep$POS),]
Form = "Allele ~ Bulk * Parent"

testg <- glm(data = test, formula = Form, weights = SmoothCount, family = "binomial")
summary(testg)

To calculate logistic regression, use either glm_cb2_short() or glmer_cb2_short() (requires lme4 package), which will conduct the logistic regression for EACH individual position. Note that we use dplyr's summarize (or reframe) to parallelize for all positions. Using a mixed model will take significantly more time. The outputlength parameter of glm_cb2_short() should match the labels (here shown as "Factor") within summarize().

# Run full dataset
rawdata_glm_prep %>% na.omit() %>% 
  filter(CHROM %in% c("M", "I") == FALSE) %>%
  group_by(CHROM, POS) %>%
  mutate_if(is.character, as.factor) %>%
  summarize(summary = glm_cb2_short(Allele = Allele,
                             Bulk = Bulk,
                             Parent = Parent,
                             #Rep = Rep, #any parameters not used should NOT be included
                             W = SmoothCount, #use the weights variable from test
                             formula = "Allele ~ Bulk * Parent",
                             outputlength = 4),
            #MAKE SURE THIS IS THE SAME LENGTH AS OUTPUT LENGTH
            Factor = (c("Intercept", "Bulk", "Parent", "Interaction"))) -> processed_glm_all

Permutations for False Discovery Rate

Complete the following TWICE, one for each new pseudoreplicate "Bulk".

rawdata_glm_prep %>% na.omit() %>% 
  filter(Bulk == "Unselected") %>%
  filter(CHROM %in% c("M", "I") == FALSE) %>%
  group_by(CHROM, POS) %>%
  pivot_wider(names_from = Allele, values_from = SmoothCount) %>% unnest() %>%
  mutate(label = paste(CHROM, POS, sep = "_")) %>%
  group_by(Bulk, Parent) %>%
  summarize(NewPOS = sample(label),
            Oak = Oak,
            Wine = Wine,
            Parent = Parent) %>%
  separate(NewPOS, into = c("CHROM", "POS"), sep = "_") %>%
  mutate(Bulk = "Selected") -> rd_shuffled_selected

rawdata_glm_prep %>% na.omit() %>% 
  filter(Bulk == "Unselected") %>%
  filter(CHROM %in% c("M", "I") == FALSE) %>%
  group_by(CHROM, POS) %>%
  pivot_wider(names_from = Allele, values_from = SmoothCount) %>% unnest() %>%
  mutate(label = paste(CHROM, POS, sep = "_")) %>%
  group_by(Bulk, Parent) %>%
  summarize(NewPOS = sample(label),
            Oak = Oak,
            Wine = Wine,
            Parent = Parent) %>%
  separate(NewPOS, into = c("CHROM", "POS"), sep = "_") %>%
  mutate(Bulk = "Unselected") -> rd_shuffled_unselected

Combine the two permuted datasets and set contrasts to match original data analysis.

rbind(rd_shuffled_selected,rd_shuffled_unselected) %>%
  ungroup() %>%
  pivot_longer(c(Oak, Wine), names_to = "Allele", values_to = "SmoothCount") %>% unnest() %>%
  mutate(POS = as.numeric(POS)) %>%
  mutate_if(is.character, as.factor) -> Perm1contrasts

#Set contrasts for experiment
contrasts(Perm1contrasts$Parent) <- matrix(c(-0.5, 0.5))
contrasts(Perm1contrasts$Bulk) <- matrix(c(1,0))

Conduct the analysis for permuted data as done above.

Perm1contrasts %>% na.omit() %>% 
  filter(CHROM %in% c("M", "I") == FALSE) %>%
  group_by(CHROM, POS) %>%
  mutate_if(is.character, as.factor) %>%
  summarize(summary = glm_cb2_short(Allele = Allele,
                             Bulk = Bulk,
                             Parent = Parent,
                             W = SmoothCount,
                             formula = "Allele ~ Bulk * Parent",
                             outputlength = 4),
            #MAKE SURE THIS IS THE SAME LENGTH AS OUTPUT LENGTH
            Factor = (c("Intercept", "Bulk", "Parent", "Interaction"))) -> permuted_glm_all

Find 5% quantile for each coefficient (labeled Factor):

permuted_glm_all %>% ungroup() %>% group_by(Factor) %>% summarize(q5 = quantile(summary, 0.95))

Call Peaks

Renaming columns for cybr_lmpeaks() as multiple chromosomes and fixed chromosomes (CSS) might be used:

processed_glm_all$zscore <- processed_glm_all$summary
processed_glm_all$CSS <- "I"
processed_glm_all$label <- processed_glm_all$Factor

Script for finding Bulk peaks within 300bp windows. This can be parallelized using dplyr's rollapply().

testpeaks <- cybr_lmpeaks(processed_glm_all[processed_glm_all$Factor == "Bulk",], width = 300)

Visualizing Data

Manhattan Plot

Data from this example will include one chromosome, but it is important to facet by chromosome for datasets containing multiple chromosomes.

processed_glm_all %>% 
  ggplot(aes(x = POS, y = abs(summary), color = Factor)) + geom_line() +
  facet_grid(~CHROM, space = "free", scales = "free") + ggtitle("BSA Bulk Z Scores")

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
R		R
TestFiles		TestFiles
data		data
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
cybrBSA.Rproj		cybrBSA.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cybrBSA

Pre-processing

Convert VCF to GATK table (pre-processing for cybrBSA)

Analysis

Variant Table Processing

Smooth Data

Convert factors and contrasts

Calculate z-scores for glm

Permutations for False Discovery Rate

Call Peaks

Visualizing Data

Manhattan Plot

About

Uh oh!

Releases

Packages

Languages

cbuzby/cybrBSA

Folders and files

Latest commit

History

Repository files navigation

cybrBSA

Pre-processing

Convert VCF to GATK table (pre-processing for cybrBSA)

Analysis

Variant Table Processing

Smooth Data

Convert factors and contrasts

Calculate z-scores for glm

Permutations for False Discovery Rate

Call Peaks

Visualizing Data

Manhattan Plot

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages