Skip to content

stothard-group/wgbs_workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Whole Genome Bisulfite Sequencing Workflow

  • FILE: workflow/snakefile
  • AUTH: Emily Herman (eherman@ualberta.ca)
  • DATE: MAY 2, 2022
  • VERS: 1.0

This Snakemake workflow processes whole genome bisulfite sequence data and extracts single nucleotide variants. The input is short read WGBS data in fastq format, and the output is a genotyped VCF file for each sample.

For an explanation of each workflow step, see WGBS_workflow.md

README Sections

Installation

Setup

Input Files

Running the Workflow

Memory Requirements

Output Files

Installation

Download the repository

git clone https://github.com/stothard-group/wgbs_workflow.git

Setup

Requirements:

The absolute paths to the programs TrimGalore and CGMapTools must be supplied in the config/WGBS.config.yaml file.

If the workflow is not run on a Compute Canada cluster, install the following programs and remove module load ... lines from shell commands in the workflow/snakefile file. Programs must be installed in the user's path, or their path must be given in the shell command. Note that many of these tools require their own dependencies.

Install the following python modules:

  • pandas >= 1.3.5

Input files

The following inputs are required:

Raw fastq files should be gzip-compressed with the filename structure {sample}_R1.fastq.gz and {sample}_R2.fastq.gz. Read group ID information will be used to merge multiplexed samples.

The paths to the required input files and directories should be edited in the file config/WGBS.config.yaml. This file also contains species- and analysis-specific parameters affecting workflow function.

Running the workflow

The following command will run up to 20 workflow jobs on a SLURM cluster:

snakemake --snakefile workflow/snakefile --cluster "sbatch -N 1 -c {resources.cores} --mem={resources.mem_mb} --time={resources.runtime} --account=account" --rerun-incomplete --printshellcmds -j 20

Memory requirements

This workflow requires at most 16 cores and 64GB RAM for running Bismark alignment and deduplication steps. The longest job is Bismark alignment, for which 12 hours is allocated by default. Resource limits can be adjusted for individual rules within the workflow/snakefile file.


NOTE

If using NovaSeq6000 samples, add --2colour 20 to the shell command for rule trimgalore.

The core and RAM requirements for Bismark alignment (rule align) have a special relationship to the value of the threads parameter. Please consult the information on the --parallel option in Bismark documentation Appendix (II) before changing any of these values.


Output files

All intermediate and final results files are written to the results/ directory. For each sample, a quality-filtered genotyped VCF file in a GATK-like format is produced: snps/{sample}.goodSNPs.vcf.

Below is a list of all output directories and files created by the workflow in results/. Note that for non-multiplexed samples, {inputID} and {sample} values will be identical. Otherwise, {sample} name is determined by the read group ID file specified in config/WGBS.config.yaml.

Directory File name Description
split {inputID}_R{1,2}_part{n}.fastq.gz Raw fastq files split into n-parts (specified in config)
trimmed {inputID}_R{1,2}_part{n}_val_1.fq.gz Trimmed fastq files split into n-parts
align {inputID}_R1_part{n}_val_1_bismark_bt2_pe.bam Bam alignment file produced by Bismark for each n-part
align {inputID}.bismark.m.bam Merged bam file for each input ID
base_quality_recal {inputID}.deduplicated.bam Bam file following read deduplication
base_quality_recal {inputID}.deduplicated.rgid.bam Deduplicated bam file with read group IDs added
base_quality_recal {sample}.deduplicated.rgid.m.bam Bam files merged per sample if multiplexed
base_quality_recal {sample}.deduplicated.rgid.m.sorted.bam Sorted merged bam file
stat_and_coverage {sample}.average_coverage.txt Average sequencing depth across the sample
stat_and_coverage {sample}.bases_covered.txt Number of bases above the minimum depth specified in config
stat_and_coverage {sample}.alignment_stats.txt Read alignment statistics
cgmap {sample}.ATCGmap.gz ATCGmap file derived from sample bam
cgmap {sample}.bayes_dynamicP.SNPs.{vcf,snv} Variants in both VCF and SNV format files
cgmap {sample}.bayes_dynamicP.SNPs.reheader.vcf VCF file with correct sample info in header
snps {sample}.bayes_dynamicP.SNPs.rh.noNs.vcf VCF file with non-A/T/G/C variants removed
snps {sample}.rawSNPs.vcf VCF file containing only biallelic SNPs
snps {sample}.filtSNPs.vcf Soft-filtered VCF file with FAIL_DP filter tag
snps {sample}.goodSNPs.vcf VCF file containing quality-filtered SNPs

Benchmark files are created for every rule with cluster submission detailing job resource use. These files are found in the directory benchmark/ and have the structure {rule}.{inputFileBasename}.txt.

About

Workflow for finding SNPs in whole genome bisulfite sequencing data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published