Skip to content

JetBrains-Research/chipseq-smk-pipeline

Repository files navigation

JetBrains Research

chipseq-smk-pipeline

Snakemake based pipeline for ChIP-seq and ATAC-seq datasets processing from raw data QC and alignment to visualization and peak calling.

Scheme

During peak calling steps chipseq-smk-pipeline automatically matches signal with control file by names proximity.

Input

Input FASTQ files

Pipeline aligned FASTQ or gzipped FASTQ reads, defined in config.yaml.
Reads folder is a relative path in pipeline working directory and defined by fastq_dir property.
FASTQ reads extension is defined by fastq_ext property, e.g. could be fq, fq.gz, fastq, fastq.gz.

Input BAM files

Use start_with_bams=True config option to start with existing bam files.
Pipeline starts with BAM files in work_dir/bams folder.

Files

Path Description
config.yaml Default pipeline options
trimmed Trimmed FASTQ file, if trim_reads option is True.
bams BAMs with aligned reads, MAPQ >= 30
bw BAM coverage visualization using DeepTools
<peak_caller_name> Peaks provided by peak caller tool <peak_caller_name>
qc QC Reports
multiqc MultiQC reports for different steps
logs Shell commands logs

Requirements

The pipeline requires conda.

  • If conda is not installed, follow the instructions at Conda website.
  • Navigate to repository directory.

Create a Conda environment for snakemake:

$ conda env create --file environment.yaml --name snakemake

Activate the newly created environment:

$ source activate snakemake

On Ubuntu please ensure that gawk is installed:

$ sudo apt-get install gawk

Launch

Run the pipeline to start with fastq reads:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config fastq_dir=<fastq_dir> genome=<genome> --rerun-incomplete

The Default pipeline doesn't perform coverage visualization and launch peak callers.
Please add bw=True, <peak_caller_name>=True to create coverage bw files and call peaks with <peak_caller_name>.

See config.yaml for a complete list of parameters. Use--config to override default options from config.yaml file.

Peak callers

Supported peak caller tools:

To launch MACS2 in --broad mode, use the following config:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all [--cores <cores>] --use-conda --directory <work_dir> \
    --config fastq_dir=<fastq_dir> genome=<genome> \
    macs2=True macs2_mode=broad macs2_params="--broad --broad-cutoff 0.1" macs2_suffix=broad0.1 \
    --rerun-incomplete

Peak callers installation

This section contains instructions for manual peak callers installation.

  • BayesPeak

    1. Install R
    mamba install  -c conda-forge r-base=3.6.3
    
    1. In R console
    if (!requireNamespace("BiocManager", quietly = TRUE))
        install.packages("BiocManager")
    BiocManager::install(version = "3.10")  # Explicitly set correct Bioconductor version
    BiocManager::install(c("IRanges", "GenomicRanges"))
    
    1. Install BayesPeak
    wget https://www.bioconductor.org/packages//2.10/bioc/src/contrib/BayesPeak_1.8.0.tar.gz
    R CMD INSTALL BayesPeak_1.8.0.tar.gz 
    
  • Hotspot

    1. Install required dependencies
    sudo apt-get install build-essential libgsl-dev
    
    1. Download and make
    wget https://github.com/StamLab/hotspot/archive/refs/tags/v4.1.1.zip
    gunzip v4.1.1.zip
    cd hotspot-4.1.1/hotspot-distr/hotspot-deploy
    make
    
  • PeakSeq
    Download and make

    git clone https://github.com/gersteinlab/PeakSeq.git
    cd PeakSeq
    make
    

Rules

Rules DAG produced with additional command line arguments --forceall --rulegraph | dot -Tpdf > rules.pdf

Rules

Computational cluster QSUB/LFS/QSUB

Configure profile for required cluster system with name cluster.

$ mkdir -p ~/.config/snakemake
$ cd ~/.config/snakemake
$ cookiecutter https://github.com/iromeo/generic.git

Example of ATAC-Seq processing on qsub

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all --use-conda --directory <work_dir> \
    --profile cluster --cluster-config cluster_config.yaml --jobs 150 \
    --config fastq_dir=<fastq_dir> genome=<genome> \
    bowtie2_params="-X 2000 --dovetail" \
    macs2=True macs2_params="-q 0.05 -f BAMPE --nomodel --nolambda -B --call-summits" \
    omnipeak=True omnipeak_fragment=0 --rerun-incomplete

P.S: Use --config to override default options from config.yaml file

Try with test data

Please download example fastq.gz files from CD14_chr15_fastq folder.
These files are filtered on human hg19 chr15 to reduce size and make computations faster.

Launch chipseq-smk-pipeline:

$ snakemake -p -s <chipseq-smk-pipeline>/Snakefile \
    all --use-conda --cores all --directory <work_dir> \
    --config fastq_ext=fastq.gz fastq_dir=<work_dir> bw=True genome=hg19 macs2=True sicer=True omnipeak=True \
    --rerun-incomplete

Useful links

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •