Here we have included the data and analysis for preprint: doi:TBD
All raw data is linked or included below, and analysis scripts are included in the two folders: SequencingScripts
and BSA_Analysis
. Supplemental figures and text can be found in PDF form in publication (linked above).
- CuSO4_glmer_output.csv: Additive peaks called using the cybr_lmpeaks() function
- Interaction_Peaks.csv: Interaction peaks, which are the highest points on each chromosome above the 5% FDR
- SRA Repository PRJNA1175662
- output.table files in the
Input
folder
a. AllCuSO4.REF_.SortedCat.vcf.output.zip: raw data, zipped via Windows
b. Oak_VCF.txt and Wine_VCF.txt: text files for parent data
c. experiment_names_524.csv: index of experiment names, to be loaded in to analysis scripts
An interactive RShiny app is available at https://cbuzby.shinyapps.io/VisualizeGLM/
Please download the package by visiting the cybrBSA github or by running the following code in R:
library(devtools)
install_github("cbuzby/cybrBSA")
Sequencing scripts are located in the SequencingScripts
folder; we used Nextflow, but each step can be run individually. Workflow is as follows (and can be found in the Aug2024_CGSB_template.nf
Workflow segment):
#trim
java -jar /share/apps/trimmomatic/0.36/trimmomatic-0.36.jar SE -phred33 \
$reads $trimmed_file ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
#align
bwa mem -Y -K 100000000 -R \"${readGroup}\" \
$REF $read_1 > ${id}.aln-se.sam
#sort
java -jar $PICARD_JAR SortSam \
INPUT=$read_1 \
OUTPUT=${id}.aln-se.bam \
SORT_ORDER=coordinate
#MakeBQSRTable
gatk BaseRecalibrator -R $REF -I $read --known-sites $KNOWN_SITES -O ${id}_recal_data.table
#ApplyBQSR
gatk ApplyBQSR \
-R $REF \
-I ${bamfile} \
--bqsr-recal-file ${table} \
-O ${id}.bqsr.output.bam
Once raw sequences were processed into bam files for each bulk, we combined all bulks together and ran variant calling on each chromosome (containing all bulks) in parallel:
############## CB_2.3_merge.split.all.q #######################################
samtools merge AllCuSO4 $1 $2... ${70}
samtools index AllCuSO4
bamtools split -in AllCuSO4 -reference
############## CB_3.0_Index.q #################################################
samtools index $1
############## CB_4.0_CallVariants.q ##########################################
gatk HaplotypeCaller -I $1 -R $REF -ploidy 1 -O ${1}.vcf
############## CB_5.0_zip.concat.sort.q #######################################
for i in ${1}*vcf; do bgzip -c $i > ${i}.gz; done
echo ${1}*vcf.gz | xargs -n1 tabix -p vcf
bcftools concat -o unsortedcat.vcf -a -D ${1}*vcf.gz
bcftools sort -Oz -o ${1}.SortedCat.vcf unsortedcat.vcf
myfile=${1}.SortedCat.vcf
gatk VariantsToTable \
-V ${myfile} \
-F CHROM -F POS -F REF -F ALT \
-GF AD -GF DP -GF GQ -GF PL \
-O ${myfile}.output.table
Please see READ.ME in the Sequencing
folder for execution code.
Files for analyzing output tables for epicQTL are found in the BSA_Analysis folder, with folders for Input
and Output
folders for data. All analysis is done using the CuPAPER_1byRep_Process.Rmd
file, which utilizes the cybrBSA package. Analysis scripts for output table data can all be found in /BSA_Analysis/CuPAPER_1byRep_Process.Rmd. Additional examples (and a sample dataset) can be found in the documentation for cybrBSA.
The output table is read in through cybrInputGATKTable()
, and then "called" for parent alleles using Wine.txt and Oak.txt (which are included in supplemental data).
mydatatotest = "AllCuSO4.REF_.SortedCat.vcf.output"
cybrInputGATKTable(mydatatotest) %>% mutate(Coverage = as.numeric(AD.REF) + as.numeric(AD.ALT)) %>%
select(POS, CHROM, Dataset, GQ, AD.REF, AD.ALT, Coverage) -> rawdata
parentSNPids <- cybrConvertParentalAlleles(Truncate = TRUE)
rawdata %>%
merge(parentSNPids) %>%
filter(grepl("HNGLVDRXY", Dataset)) %>% #Filter for smaller dataset
mutate(REFW = as.numeric(Type == "Wine"), REFO = as.numeric(Type == "Oak")) %>%
group_by(Dataset, CHROM, POS) %>%
mutate(Wine = max(REFW * as.numeric(AD.REF), REFO * as.numeric(AD.ALT)),
Oak = max(REFW * as.numeric(AD.ALT), REFO * as.numeric(AD.REF))) %>%
select(Dataset, POS, CHROM, Coverage, Wine, Oak) %>%
pivot_longer(c(Oak, Wine), names_to = "Allele", values_to = "Reads") -> rawdata_called
Next, data is smoothed to reduce the error from sequencing at each position, using a weighted gaussian:
rawdata_called %>% group_by(Dataset, CHROM, Allele) %>% arrange(POS) %>%
reframe(POS = POS,
SmoothCount = ceiling(frollapply(Reads, n = 200, FUN = cybr_weightedgauss, align = "center"))
) -> rawdata_smoothed
Finally, z-scores are calculated for EACH position using glmer_cb2_short()
. The contrasts in our experiment are adjusted using the contrasts()
function in R.
rawdata_glm_prep %>% na.omit() %>%
filter(CHROM %in% c("M", "I") == FALSE) %>%
group_by(CHROM, POS) %>%
mutate_if(is.character, as.factor) %>%
summarize(summary = glmer_cb2_short(Allele = Allele,
Bulk = Bulk,
Parent = Parent,
Rep = Rep, #any parameters not used should NOT be included
W = SmoothCount,
formula = "Allele ~ Bulk * Parent + (1 | Rep)",
outputlength = 4),
#MAKE SURE THIS IS THE SAME LENGTH AS OUTPUT LENGTH
Factor = (c("Intercept", "Bulk", "Parent", "Interaction"))) -> processed_glm_all
Permutations for the false discovery rate of 5% are done as described in Materials and Methods.
Scripts for each figure of the manuscript are included in the visualizations /BSA_Analysis/CuPAPER_Figures.Rmd, and utilize the Ouput
data within the folder.