Skip to content

Commit a6f4f8e

Browse files
katsikorakatarzyna.otylia.sikora@gmail.com
and
katarzyna.otylia.sikora@gmail.com
authored
chipqc (#1095)
* added chipqc * chipqc env * fixes #1051 * narrow and broad * internals * macs2 handling * chipqc wdir * use seacr stringent for chipqc * rm seacr relaxed * rm chipqc for Genrich * spikein chipqc * handle empty peak files csaw * histoneHMM chipqc * chipqc spikein * News.rst * cleanup snakefiles * chipqc docs * news * pytest * test dag --------- Co-authored-by: katarzyna.otylia.sikora@gmail.com <sikora@maximus.ie-freiburg.mpg.de>
1 parent 3d07957 commit a6f4f8e

File tree

17 files changed

+395
-362
lines changed

17 files changed

+395
-362
lines changed

.ci_stuff/test_dag.sh

Lines changed: 35 additions & 35 deletions
Large diffs are not rendered by default.

conda-recipe/meta.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
{% set version = "3.1.0" %}
1+
{% set version = "3.2.0" %}
22

33
package:
44
name: snakepipes

docs/content/News.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,10 @@ snakePipes News
44
snakePipes 3.2.0
55
________________
66

7+
8+
* QC in the ChIPseq workflow is now performed with the ChIPQC R package
79
* added allelic-whatshap mode to mRNA seq
10+
* fixes #1048
811
* fixes #1085
912
* fixes #1083
1013
* fixes #1082

docs/content/workflows/ChIPseq.rst

Lines changed: 25 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -163,41 +163,28 @@ Understanding the outputs
163163
The ChIPseq pipeline will generate additional output as follows::
164164

165165
.
166+
├── AnnotatedResults_MACS2_diffChIP_k4me3
167+
├── Annotation
168+
├── bamCoverage
169+
├── CSAW_MACS2_diffChIP_k4me3
166170
├── deepTools_ChIP
167171
│   ├── bamCompare
168-
│   │   ├── sample1.filtered.log2ratio.over_SRR6761502.bw
169-
│   │   ├── sample1.filtered.subtract.SRR6761502.bw
170-
│   │   ├── sample2.filtered.log2ratio.over_SRR6761502.bw
171-
│   │   └── sample2.filtered.subtract.SRR6761502.bw
172172
│   └── plotFingerprint
173-
│   ├── plotFingerprint.metrics.txt
174-
│   └── plotFingerprint.png
173+
├── deepTools_qc
174+
│   ├── bamPEFragmentSize
175+
│   ├── estimateReadFiltering
176+
│   ├── multiBamSummary
177+
│   ├── plotCorrelation
178+
│   ├── plotCoverage
179+
│   └── plotPCA
180+
├── filtered_bam
175181
├── histoneHMM
176-
│   ├── sample2.filtered.histoneHMM-em-posterior.txt.gz
177-
│   ├── sample2.filtered.histoneHMM-regions.gff.gz
178-
│   ├── sample2.filtered.histoneHMM-regions.gff.gz.tbi
179-
│   ├── sample2.filtered.histoneHMM.txt.gz
180-
│   ├── sample2.filtered.histoneHMM-zinba-emfit.pdf
181-
│   ├── sample2.filtered.histoneHMM-zinba-params-em.RData
182-
│   └── sample2.filtered.histoneHMM-zinba-params-em.txt
183-
├── Genrich
184-
│   └── sample2.narrowPeak
185-
└── MACS2
186-
   ├── sample1.filtered.BAM_peaks.narrowPeak
187-
   ├── sample1.filtered.BAM_peaks.qc.txt
188-
   ├── sample1.filtered.BAM_peaks.xls
189-
   ├── sample1.filtered.BAMPE_peaks.narrowPeak
190-
   ├── sample1.filtered.BAMPE_peaks.xls
191-
   ├── sample1.filtered.BAMPE_summits.bed
192-
   ├── sample1.filtered.BAM_summits.bed
193-
   ├── sample2.filtered.BAM_peaks.broadPeak
194-
   ├── sample2.filtered.BAM_peaks.gappedPeak
195-
   ├── sample2.filtered.BAM_peaks.qc.txt
196-
   ├── sample2.filtered.BAM_peaks.xls
197-
   ├── sample2.filtered.BAMPE_peaks.broadPeak
198-
   ├── sample2.filtered.BAMPE_peaks.gappedPeak
199-
   └── sample2.filtered.BAMPE_peaks.xls
200-
182+
├── histoneHMM_chipqc
183+
├── logs
184+
├── MACS2
185+
├── MACS2_chipqc
186+
├── QC_report
187+
├── Sambamba
201188

202189

203190
Following up on the DNAmapping module results (see :doc:`DNAmapping`), the workflow produces the following output directories :
@@ -206,12 +193,16 @@ Following up on the DNAmapping module results (see :doc:`DNAmapping`), the workf
206193

207194
* **Genrich**: This folder contains the output of `Genrich <https://github.com/jsh58/Genrich>`__. This will only exist IF you specified ``--peakCaller Genrich`` and you have samples with non-broad peaks. The output is in narrowPeak format, like that from MACS2.
208195

209-
* **MACS2**: This folder contains the output of `MACS2 <https://github.com/taoliu/MACS>`__ on the ChIP samples, MACS2 would perform either a **narrow** or **broad** peak calling on the samples, as indicated by the ChIP sample configuration file (see :ref:`ChIPconfig`). The outputs files would contain the respective tags (**narrowPeak** or **broadPeak**). This folder will only exist if you have non-broad marks and use MACS2 for peak calling
196+
* **MACS2**: This folder contains the output of `MACS2 <https://github.com/taoliu/MACS>`__ on the ChIP samples, MACS2 would perform either a **narrow** or **broad** peak calling on the samples, as indicated by the ChIP sample configuration file (see :ref:`ChIPconfig`). The outputs files would contain the respective tags (**narrowPeak** or **broadPeak**). This folder will only exist if you have non-broad marks and use MACS2 for peak calling (default).
197+
198+
* **MACS2_chipqc**: This folder contains the output of `ChIPQC <https://bioconductor.org/packages/release/bioc/vignettes/ChIPQC/inst/doc/ChIPQC.pdf>`__ analysis of the peaks called by MACS2. If you used a different peak caller, a chipqc output folder with the peak caller in its name will be listed.
210199

211200
* **histoneHMM**: This folder contains the output of `histoneHMM <https://github.com/matthiasheinig/histoneHMM>`__. This folder will only exist if you have broad marks.
212201

213-
* **CSAW_sampleSheet**: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see :ref:`diffBinding`) CSAW will be run using peaks called by the chosen peak caller, and the output folder will be named accordingly.
214-
* **AnnotatedResults_sampleSheet**: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see :ref:`diffBinding`). Differentially bound regions annotated with distance to nearest gene are stored here.
202+
* **histoneHMM_chipqc**: This folder contains the output of `ChIPQC <https://bioconductor.org/packages/release/bioc/vignettes/ChIPQC/inst/doc/ChIPQC.pdf>`__ analysis of the peaks called by histoneHMM. This folder will only exist if you have broad marks.
203+
204+
* **CSAW_peakCaller_sampleSheet**: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see :ref:`diffBinding`) CSAW will be run using peaks called by the chosen peak caller, and the output folder will be named accordingly.
205+
* **AnnotatedResults_peakCaller_sampleSheet**: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see :ref:`diffBinding`). Differentially bound regions annotated with distance to nearest gene are stored here.
215206

216207
.. note:: Although in case of broad marks, we also perform the MACS2 `broadpeak` analysis (output available as ``MACS2/<sample>.filtered.BAM_peaks.broadPeak``), we would recommend using the histoneHMM outputs in these cases, since histoneHMM produces better results than MACS2 for broad peaks.
217208

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
66
name = "snakePipes"
77
description = 'Snakemake workflows and wrappers for NGS data processing from the MPI-IE'
88
readme = "README.md"
9-
version = "3.1.0"
9+
version = "3.2.0"
1010
keywords = [
1111
"DNAmapping",
1212
"ChIPSeq",

snakePipes/common_functions.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,8 @@ def set_env_yamls():
4646
'CONDA_SEACR_ENV': 'envs/chip_seacr.yaml',
4747
'CONDA_FQLINT_ENV': 'envs/fqlint.yaml',
4848
'CONDA_WHATSHAP_ENV': 'envs/whatshap.yaml',
49-
'CONDA_PICARD_ENV': 'envs/picard.yaml'
49+
'CONDA_PICARD_ENV': 'envs/picard.yaml',
50+
'CONDA_CHIPQC_ENV': 'envs/chipqc.yaml'
5051
}
5152

5253

@@ -783,6 +784,8 @@ def runAndCleanup(args, cmd, logfile_name):
783784
for _l in p.stdout:
784785
sys.stdout.write(_l.strip() + '\n')
785786
f.write(_l.strip() + '\n')
787+
sys.stdout.flush()
788+
f.flush()
786789
p.wait()
787790

788791
# Exit with an error if snakemake encountered an error

snakePipes/shared/rscripts/CSAW.R

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,9 +118,13 @@ if (! external_bed) {
118118
})
119119
} else {
120120
allpeaks = lapply(snakemake@input[['peaks']], function(x) {
121-
bed = read.delim(paste0("../", x), header=FALSE)
122-
bed.gr = GRanges(seqnames = bed$V1, ranges = IRanges(start = bed$V2, end = bed$V3), name = bed$V4)
123-
return(bed.gr)
121+
peakfile<-paste0("../", x)
122+
if(file.exists(peakfile) & file.info(peakfile)$size > 0){
123+
bed = read.delim(peakfile, header=FALSE)
124+
bed.gr = GRanges(seqnames = bed$V1, ranges = IRanges(start = bed$V2, end = bed$V3), name = bed$V4)
125+
}else{message(paste0("Skipping peakfile ",peakfile))
126+
bed.gr=GRanges(c(seqnames=NULL,ranges=NULL,strand=NULL,name=NULL))}
127+
return(bed.gr)
124128
})
125129
}
126130
# merge

snakePipes/shared/rscripts/chipqc.R

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
#!/usr/bin/env Rscript
2+
3+
library(GenomicRanges)
4+
library(rtracklayer)
5+
library(ChIPQC)
6+
library(yaml)
7+
library(stringr)
8+
library(purrr)
9+
10+
11+
#options(MulticoreParam=MulticoreParam(workers=8))
12+
register(MulticoreParam(8))
13+
registered()$MulticoreParam
14+
15+
bamdir<-unlist(snakemake@params[["bams"]])
16+
peakdir<-unlist(snakemake@params[["peaks"]])
17+
genome<-gsub("_.+","",snakemake@params[["genome"]])
18+
wdir <- snakemake@params[["outdir"]]
19+
blacklist<-snakemake@params[["blacklist"]]
20+
chipdict<-snakemake@input[["chipdict"]]
21+
22+
setwd(wdir)
23+
24+
spikein<-toupper(snakemake@params[["useSpikeinForNorm"]])
25+
message(paste0("useSpikeinForNorm is set to: ",spikein))
26+
if(spikein){
27+
ms<-"host"}else{ms<-"filtered"}
28+
29+
30+
sampleSheet<-snakemake@input[["sampleSheet"]]
31+
32+
#take samples,marks,replicates from the union of narrow samples and broad samples
33+
34+
#yaml<-read_yaml(chipdict,as.named.list=TRUE) #not used due to buggy conversion of yaml -> list with NULL entries -> data.frame: all-NULL entries are dropped entirely
35+
narrow_samples<-unlist(snakemake@params[["narrow_samples"]])
36+
broad_samples<-unlist(snakemake@params[["broad_samples"]])
37+
samples<-c(narrow_samples,broad_samples)
38+
ydat<-data.frame("sample"=samples,"broad"=c(rep(FALSE,length(narrow_samples)),rep(TRUE,length(broad_samples))))
39+
rownames(ydat)<-ydat$sample
40+
41+
ydat
42+
43+
#list of supported factors
44+
markv<-c("H3K4me1","H3K4me2","H3K4me3","H3K27ac","H3K27me3","H3K9me3","H3K36me3","H4K16ac","RAD21","CTCF","MSL2","BMAL1","CLOCK")
45+
a<-sapply(markv,function(X)grep(X,samples,ignore.case=TRUE),simplify=TRUE)
46+
a<-a[!lapply(a,length)<1]
47+
b<-unlist(a)
48+
names(b)<-sub("[0-9]$","",names(b))
49+
markv<-names(sort(b))
50+
if(all(is.na(markv))){
51+
markv<-rep("All",length(samples))
52+
}
53+
54+
if(all(grepl("rep",samples))){
55+
#regres<-regexpr("rep[0-9]?",samples)
56+
repv<-str_extract(samples,"rep[0-9]+")
57+
repv<-as.numeric(gsub("rep","",repv))
58+
}else{
59+
repv<-rep(1,length(samples))
60+
}
61+
62+
#check if sample sheet is NA or a file path
63+
#first implementation: ignore sample sheet and condition and replicates
64+
#if sample sheet is a file path: get condition and replicate information
65+
#the check that the sample sheet file exists is taken care of by the python wrapper
66+
if (!is.null(sampleSheet)){
67+
sampleinfo<-read.table(sampleSheet,header=TRUE,sep="\t",quote="")
68+
condv<-sampleinfo$condition[match(samples,sampleinfo$name)]
69+
}else{
70+
condv<-rep("All",length(samples))
71+
}
72+
73+
74+
sampledat<-data.frame("SampleID"=samples,"Condition"=condv,"Factor"=markv,"Replicate"=repv)
75+
76+
#ensure that samples,bamdir and peakdir are in the same order!
77+
78+
sampledat$bamReads<-bamdir[match(samples,sub(paste0("\\.",ms,".bam"),"",basename(bamdir)))]
79+
message(sprintf("Provided peak files: %s", unlist(peakdir)))
80+
##for MACS2, modify input peak files: .xls -> .narrowPeak, .broadPeak
81+
if(all(grepl("histoneHMM",peakdir))){
82+
sampledat$Peaks<-peakdir[match(samples,sub("_avgp0.5.bed","",basename(peakdir)))]
83+
}else{sampledat$Peaks<-peakdir[match(samples,sub("\\.filtered.+","",basename(peakdir)))]}
84+
85+
sampledat$PeakCaller<-"bed"
86+
sampledat$PeakFormat<-"bed"
87+
if(all(grepl("MACS2",sampledat$Peaks))){
88+
89+
#samples should be in the same order
90+
sampledat$Peaks[ydat$broad==TRUE]<-gsub(paste0(".",ms,".BAM_peaks.xls"),paste0(".",ms,".BAM_peaks.broadPeak"),sampledat$Peaks[ydat$broad==TRUE])
91+
sampledat$Peaks[ydat$broad==FALSE]<-gsub(paste0(".",ms,".BAM_peaks.xls"),paste0(".",ms,".BAM_peaks.narrowPeak"),sampledat$Peaks[ydat$broad==FALSE])
92+
sampledat$PeakFormat[ydat$broad==FALSE]<-"narrow"
93+
sampledat$PeakCaller[ydat$broad==FALSE]<-"narrow"
94+
}
95+
96+
sampledat
97+
98+
##annotation -> check for supported genome versions
99+
message(paste0("Provided genome: ",genome))
100+
supported_annotations<-c("hg19","hg18","mm10","mm9","ce6","dm3")
101+
extended_annotations<-c("GRCh38","GRCh37","GRCm38","GRCm37","ce6","dm3")
102+
#modify genome string
103+
if( genome %in% supported_annotations){
104+
105+
annotation<-genome
106+
} else if (genome %in% extended_annotations){
107+
108+
annotation<-supported_annotations[grep(genome,extended_annotations)]
109+
110+
}else {stop("No matching annotation was found.")}
111+
112+
if(file.exists(blacklist)){
113+
blist<-blacklist}else{blist<-NULL}
114+
115+
message(paste0("Using blacklist: ",blist))
116+
QC<-ChIPQC(sampledat,annotation=annotation,mapQCth=3,blacklist=blist)
117+
ChIPQCreport(QC,reportFolder=".",facet=FALSE,colourBy="Factor")
118+
119+
sink("sessionInfo.txt")
120+
sessionInfo()
121+
sink()
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
library(GenomicRanges)
2+
3+
wdir <- snakemake@params[["outdir"]]
4+
setwd(wdir)
5+
6+
input_peaks <- snakemake@params[["input_peaks"]]
7+
8+
reslist<-lapply(input_peaks,function(X)rtracklayer::import.gff(X))
9+
names(reslist)<-gsub(".filtered.histoneHMM-regions.gff","",basename(input_peaks))
10+
for(i in seq_along(reslist)){
11+
png(paste0(names(reslist)[i],"_avg_posterior.hist.png"))
12+
hist(as.numeric(mcols(reslist[[i]])$avg_posterior),main=names(reslist)[i],xlab="Average posterior probability")
13+
abline(v=0.5,col="red")
14+
dev.off()
15+
}
16+
17+
filtlist<-lapply(reslist,function(X)X[mcols(X)$avg_posterior >= 0.5,])
18+
for(i in seq_along(filtlist)){
19+
rtracklayer::export.gff3(filtlist[[i]],paste0(names(filtlist)[i],"_avgp0.5.gff"))
20+
a<-filtlist[[i]]
21+
mcols(a)$score<-as.numeric(mcols(a)$avg_posterior)
22+
rtracklayer::export.bed(a,paste0(names(filtlist)[i],"_avgp0.5.bed"))
23+
}
24+
25+
sink("sessionInfo.txt")
26+
sessionInfo()
27+
sink()

0 commit comments

Comments
 (0)