Skip to content

Commit 5305ff4

Browse files
authored
Merge pull request #1096 from maxplanck-ie/develop
* QC in the ChIPseq workflow is now performed with the ChIPQC R package * added allelic-whatshap mode to mRNA seq * fixes #1048 * fixes #1085 * fixes #1083 * fixes #1082 * fixes #1063 * fixes #1058 * fixes #1024 * fixes minor issues with mRNAseq allelic-whatshap mode
2 parents 7df9bd3 + 8b5945d commit 5305ff4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+929
-640
lines changed

.ci_stuff/test_dag.sh

Lines changed: 82 additions & 74 deletions
Large diffs are not rendered by default.

conda-recipe/meta.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
{% set version = "3.1.0" %}
1+
{% set version = "3.2.0" %}
22

33
package:
44
name: snakepipes
@@ -24,7 +24,7 @@ requirements:
2424
- snakemake-executor-plugin-cluster-generic >=1.0.9
2525
- pandas
2626
- thefuzz
27-
- pyyaml >=5.1
27+
- ruamel.yaml
2828

2929
test:
3030
commands:

docs/content/News.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,22 @@
11
snakePipes News
22
===============
33

4+
snakePipes 3.2.0
5+
________________
6+
7+
8+
* QC in the ChIPseq workflow is now performed with the ChIPQC R package
9+
* added allelic-whatshap mode to mRNA seq
10+
* fixes #1048
11+
* fixes #1085
12+
* fixes #1083
13+
* fixes #1082
14+
* fixes #1063
15+
* fixes #1058
16+
* fixes #1024
17+
* fixes minor issues with mRNAseq allelic-whatshap mode
18+
19+
420
snakePipes 3.1.0
521
________________
622

docs/content/workflows/ChIPseq.rst

Lines changed: 25 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -163,41 +163,28 @@ Understanding the outputs
163163
The ChIPseq pipeline will generate additional output as follows::
164164

165165
.
166+
├── AnnotatedResults_MACS2_diffChIP_k4me3
167+
├── Annotation
168+
├── bamCoverage
169+
├── CSAW_MACS2_diffChIP_k4me3
166170
├── deepTools_ChIP
167171
│   ├── bamCompare
168-
│   │   ├── sample1.filtered.log2ratio.over_SRR6761502.bw
169-
│   │   ├── sample1.filtered.subtract.SRR6761502.bw
170-
│   │   ├── sample2.filtered.log2ratio.over_SRR6761502.bw
171-
│   │   └── sample2.filtered.subtract.SRR6761502.bw
172172
│   └── plotFingerprint
173-
│   ├── plotFingerprint.metrics.txt
174-
│   └── plotFingerprint.png
173+
├── deepTools_qc
174+
│   ├── bamPEFragmentSize
175+
│   ├── estimateReadFiltering
176+
│   ├── multiBamSummary
177+
│   ├── plotCorrelation
178+
│   ├── plotCoverage
179+
│   └── plotPCA
180+
├── filtered_bam
175181
├── histoneHMM
176-
│   ├── sample2.filtered.histoneHMM-em-posterior.txt.gz
177-
│   ├── sample2.filtered.histoneHMM-regions.gff.gz
178-
│   ├── sample2.filtered.histoneHMM-regions.gff.gz.tbi
179-
│   ├── sample2.filtered.histoneHMM.txt.gz
180-
│   ├── sample2.filtered.histoneHMM-zinba-emfit.pdf
181-
│   ├── sample2.filtered.histoneHMM-zinba-params-em.RData
182-
│   └── sample2.filtered.histoneHMM-zinba-params-em.txt
183-
├── Genrich
184-
│   └── sample2.narrowPeak
185-
└── MACS2
186-
   ├── sample1.filtered.BAM_peaks.narrowPeak
187-
   ├── sample1.filtered.BAM_peaks.qc.txt
188-
   ├── sample1.filtered.BAM_peaks.xls
189-
   ├── sample1.filtered.BAMPE_peaks.narrowPeak
190-
   ├── sample1.filtered.BAMPE_peaks.xls
191-
   ├── sample1.filtered.BAMPE_summits.bed
192-
   ├── sample1.filtered.BAM_summits.bed
193-
   ├── sample2.filtered.BAM_peaks.broadPeak
194-
   ├── sample2.filtered.BAM_peaks.gappedPeak
195-
   ├── sample2.filtered.BAM_peaks.qc.txt
196-
   ├── sample2.filtered.BAM_peaks.xls
197-
   ├── sample2.filtered.BAMPE_peaks.broadPeak
198-
   ├── sample2.filtered.BAMPE_peaks.gappedPeak
199-
   └── sample2.filtered.BAMPE_peaks.xls
200-
182+
├── histoneHMM_chipqc
183+
├── logs
184+
├── MACS2
185+
├── MACS2_chipqc
186+
├── QC_report
187+
├── Sambamba
201188

202189

203190
Following up on the DNAmapping module results (see :doc:`DNAmapping`), the workflow produces the following output directories :
@@ -206,12 +193,16 @@ Following up on the DNAmapping module results (see :doc:`DNAmapping`), the workf
206193

207194
* **Genrich**: This folder contains the output of `Genrich <https://github.com/jsh58/Genrich>`__. This will only exist IF you specified ``--peakCaller Genrich`` and you have samples with non-broad peaks. The output is in narrowPeak format, like that from MACS2.
208195

209-
* **MACS2**: This folder contains the output of `MACS2 <https://github.com/taoliu/MACS>`__ on the ChIP samples, MACS2 would perform either a **narrow** or **broad** peak calling on the samples, as indicated by the ChIP sample configuration file (see :ref:`ChIPconfig`). The outputs files would contain the respective tags (**narrowPeak** or **broadPeak**). This folder will only exist if you have non-broad marks and use MACS2 for peak calling
196+
* **MACS2**: This folder contains the output of `MACS2 <https://github.com/taoliu/MACS>`__ on the ChIP samples, MACS2 would perform either a **narrow** or **broad** peak calling on the samples, as indicated by the ChIP sample configuration file (see :ref:`ChIPconfig`). The outputs files would contain the respective tags (**narrowPeak** or **broadPeak**). This folder will only exist if you have non-broad marks and use MACS2 for peak calling (default).
197+
198+
* **MACS2_chipqc**: This folder contains the output of `ChIPQC <https://bioconductor.org/packages/release/bioc/vignettes/ChIPQC/inst/doc/ChIPQC.pdf>`__ analysis of the peaks called by MACS2. If you used a different peak caller, a chipqc output folder with the peak caller in its name will be listed.
210199

211200
* **histoneHMM**: This folder contains the output of `histoneHMM <https://github.com/matthiasheinig/histoneHMM>`__. This folder will only exist if you have broad marks.
212201

213-
* **CSAW_sampleSheet**: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see :ref:`diffBinding`) CSAW will be run using peaks called by the chosen peak caller, and the output folder will be named accordingly.
214-
* **AnnotatedResults_sampleSheet**: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see :ref:`diffBinding`). Differentially bound regions annotated with distance to nearest gene are stored here.
202+
* **histoneHMM_chipqc**: This folder contains the output of `ChIPQC <https://bioconductor.org/packages/release/bioc/vignettes/ChIPQC/inst/doc/ChIPQC.pdf>`__ analysis of the peaks called by histoneHMM. This folder will only exist if you have broad marks.
203+
204+
* **CSAW_peakCaller_sampleSheet**: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see :ref:`diffBinding`) CSAW will be run using peaks called by the chosen peak caller, and the output folder will be named accordingly.
205+
* **AnnotatedResults_peakCaller_sampleSheet**: This folder is created optionally, if you provide a sample sheet for differential binding analysis. (see :ref:`diffBinding`). Differentially bound regions annotated with distance to nearest gene are stored here.
215206

216207
.. note:: Although in case of broad marks, we also perform the MACS2 `broadpeak` analysis (output available as ``MACS2/<sample>.filtered.BAM_peaks.broadPeak``), we would recommend using the histoneHMM outputs in these cases, since histoneHMM produces better results than MACS2 for broad peaks.
217208

docs/content/workflows/mRNAseq.rst

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ What it does
99
The snakePipes mRNAseq workflow allows users to process their single or paired-end
1010
mRNAseq fastq files upto the point of gene/transcript-counts and differential expression.
1111
It also allows full allele-specific mRNAseq analysis (up to allele-specific
12-
differential expression) using the *allelic-mapping* mode.
12+
differential expression) using the *allelic-mapping* or *allelic-whatshap* mode.
1313

1414
.. image:: ../images/RNAseq_pipeline.png
1515

@@ -94,6 +94,8 @@ There is a configuration file in ``snakePipes/workflows/mRNAseq/defaults.yaml``:
9494
# for allele-spcific mapping
9595
SNPFile:
9696
NMaskedIndex:
97+
# for allelic-whatshap
98+
pvcf:
9799
#### Flag to control the pipeline entry point
98100
fromBAM: False
99101
bamExt: '.bam'
@@ -202,6 +204,15 @@ Allele-specific, gene-level differential expression analysis is then performed u
202204

203205
**allelic-counting** mode requires the user to input, per sample, 4 bam files, corresponding to haplotype1, haplotype2, unassigned and haplotagged , e.g. as generated by whatshap. The respective suffixes ".genome1", ".genome2", ".unassigned", ".alelle_flagged" are required to be followed by the bam extention ".sorted.bam". This mode is mutually exclusive with "deepTools_qc". Only the allelic version of deepTools qc will be run, by default. Allelic version of featureCounts will be run by default. If sample sheet is provided, allelic DESeq2- or allelic Salmon-based differential gene expression analysis will be run.
204206

207+
208+
"allelic-whatshap"
209+
~~~~~~~~~~~~~~~~~~
210+
211+
**allelic-whatshap** mode applies a standard alignment to a nonmasked genome with STAR, followed by allele-specific splitting
212+
of mapped files with whatshap, requiring a phased vcf file as input ( ``--phased-vcf`` ). The file must be bzip-compressed and tabix-indexed as well. Gene-level quantification is performed for each allele using **featureCounts**.
213+
Allele-specific, gene-level differential expression analysis is then performed using **DESeq2**.
214+
215+
205216
"alignment-free"
206217
~~~~~~~~~~~~~~~~
207218

pyproject.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
66
name = "snakePipes"
77
description = 'Snakemake workflows and wrappers for NGS data processing from the MPI-IE'
88
readme = "README.md"
9-
version = "3.1.0"
9+
version = "3.2.0"
1010
keywords = [
1111
"DNAmapping",
1212
"ChIPSeq",
@@ -35,7 +35,8 @@ dependencies = [
3535
"snakemake >= 8",
3636
"pandas",
3737
"thefuzz",
38-
"pyyaml >= 5.1",
38+
# "pyyaml >= 5.1",
39+
"ruamel.yaml",
3940
"snakemake-executor-plugin-cluster-generic >= 1.0.9",
4041
"graphviz"
4142
]

snakePipes/common_functions.py

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@
55
import subprocess
66
import os
77
import re
8-
import yaml
8+
#import yaml
9+
from ruamel.yaml import YAML
910
import glob
1011
import sys
1112
import shutil
@@ -43,7 +44,10 @@ def set_env_yamls():
4344
'CONDA_SAMBAMBA_ENV': 'envs/sambamba.yaml',
4445
'CONDA_pysam_ENV': 'envs/pysam.yaml',
4546
'CONDA_SEACR_ENV': 'envs/chip_seacr.yaml',
46-
'CONDA_FQLINT_ENV': 'envs/fqlint.yaml'
47+
'CONDA_FQLINT_ENV': 'envs/fqlint.yaml',
48+
'CONDA_WHATSHAP_ENV': 'envs/whatshap.yaml',
49+
'CONDA_PICARD_ENV': 'envs/picard.yaml',
50+
'CONDA_CHIPQC_ENV': 'envs/chipqc.yaml'
4751
}
4852

4953

@@ -86,8 +90,10 @@ def namesOKinR(sampleNames):
8690

8791

8892
def load_configfile(configFiles, verbose, info='Config'):
93+
yaml=YAML(typ='safe')
8994
with open(configFiles, "r") as f:
90-
config = yaml.load(f, Loader=yaml.FullLoader)
95+
#config = yaml.load(f, Loader=yaml.FullLoader)
96+
config = yaml.load(f)
9197

9298
config = sanity_dict_clean(config)
9399

@@ -100,9 +106,15 @@ def load_configfile(configFiles, verbose, info='Config'):
100106
return config
101107

102108

103-
def write_configfile(configFile, config):
109+
def write_configfile(configFile, config, trafo):
110+
yaml=YAML(typ='safe')
111+
yaml.default_flow_style = False
104112
with open(configFile, 'w') as f:
105-
yaml.dump(config, f, default_flow_style=False)
113+
#yaml.dump(config, f, default_flow_style=False)
114+
if trafo:
115+
yaml.dump(config, f, transform=trafo)
116+
else:
117+
yaml.dump(config, f)
106118

107119

108120
# returns all key-value pairs that are different from dict1 to dict2
@@ -630,7 +642,7 @@ def commonYAMLandLogs(baseDir, workflowDir, defaults, args, callingScript):
630642
# save to configs.yaml in outdir
631643
config = defaults
632644
config.update(vars(args)) # This allows modifications of args after handling a user config file to still make it to the YAML given to snakemake!
633-
write_configfile(os.path.join(args.outdir, '{}.config.yaml'.format(workflowName)), config)
645+
write_configfile(os.path.join(args.outdir, '{}.config.yaml'.format(workflowName)), config, trafo=None)
634646

635647
# merge cluster config files: 1) global one, 2) workflow specific one, 3) user provided one
636648
cfg = load_configfile(os.path.join(baseDir, "shared", "defaults.yaml"), False, "defaults")
@@ -717,7 +729,7 @@ def print_DAG(args, snakemake_cmd, callingScript, defaults):
717729
config['verbose'] = False
718730
write_configfile(
719731
os.path.join(args.outdir,
720-
'{}.config.yaml'.format(workflowName)), config)
732+
'{}.config.yaml'.format(workflowName)), config, trafo=None)
721733

722734
DAGproc = subprocess.Popen(
723735
snakemake_cmd + " --rulegraph -q ",
@@ -732,7 +744,7 @@ def print_DAG(args, snakemake_cmd, callingScript, defaults):
732744
config['verbose'] = oldVerbose
733745
write_configfile(
734746
os.path.join(args.outdir, '{}.config.yaml'.format(workflowName)),
735-
config)
747+
config, trafo=None)
736748

737749

738750
def logAndExport(args, workflowName):
@@ -772,6 +784,8 @@ def runAndCleanup(args, cmd, logfile_name):
772784
for _l in p.stdout:
773785
sys.stdout.write(_l.strip() + '\n')
774786
f.write(_l.strip() + '\n')
787+
sys.stdout.flush()
788+
f.flush()
775789
p.wait()
776790

777791
# Exit with an error if snakemake encountered an error
@@ -792,6 +806,9 @@ def runAndCleanup(args, cmd, logfile_name):
792806
if args.emailAddress:
793807
sendEmail(args, 0)
794808

809+
def tr(s):
810+
return s.replace('null', 'None')
811+
795812

796813
def predict_chip_dict(wdir, input_pattern_str, bamExt, fromBAM=None):
797814
"""
@@ -854,14 +871,14 @@ def predict_chip_dict(wdir, input_pattern_str, bamExt, fromBAM=None):
854871
print("No control sample found!")
855872

856873
chip_dict_pred["chip_dict"][i] = {}
857-
chip_dict_pred["chip_dict"][i]['control'] = tmp
874+
chip_dict_pred["chip_dict"][i]['Control'] = tmp if tmp != "" else None
858875
if re.match(".*(H3K4me1|H3K36me3|H3K9me3|H3K27me3).*", i, re.IGNORECASE):
859-
chip_dict_pred["chip_dict"][i]['broad'] = True
876+
chip_dict_pred["chip_dict"][i]['Broad'] = True
860877
else:
861-
chip_dict_pred["chip_dict"][i]['broad'] = False
878+
chip_dict_pred["chip_dict"][i]['Broad'] = False
862879

863880
outfile = os.path.join(wdir, "chip_seq_sample_config.PREDICTED.yaml")
864-
write_configfile(outfile, chip_dict_pred)
881+
write_configfile(outfile, chip_dict_pred,trafo=tr)
865882
print("---------------------------------------------------------------------------------------")
866883
print("ChIPseq sample configuration is written to file ", outfile)
867884
print("Please check and modify this file - this is just a guess! Then run the workflow with it.")

snakePipes/shared/defaults.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# Note that due to limitations in yaml.dump, only very basic structures are
44
# permitted here.
55
################################################################################
6-
snakemakeOptions: ''
6+
snakemakeOptions: ' --keep-going '
77
organismsDir: 'shared/organisms'
88
snakemakeProfile: 'shared/profiles/local'
99
tempDir: /scratch/local

snakePipes/shared/organisms/GRCh38_gencode40.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,5 @@ ignoreForNormalization: chrX chrY chrM GL000008.2 GL000009.2 GL000194.1 GL000195
9191
KI270748.1 KI270749.1 KI270750.1 KI270751.1 KI270752.1 KI270753.1 KI270754.1 KI270755.1
9292
KI270756.1 KI270757.1
9393
known_splicesites: /data/repository/organisms/GRCh38_gencode_40/gencode/release-40/HISAT2/genome.ss
94-
rmsk_file: ''
9594
star_index: /data/repository/organisms/GRCh38_gencode_40/Indices/STAR_2.7.10
9695
rmsk_file: /data/repository/organisms/GRCh38_gencode_40/repeatMasker/genome.fa.tbl

snakePipes/shared/profiles/local/config.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@ set-resources:
6060
mem: 2G
6161
CollectInsertSizeMetrics:
6262
mem: 1G
63+
conversionRate:
64+
mem: 4G
6365
correct_matrix:
6466
mem: 7G
6567
CpG_report:
@@ -205,4 +207,4 @@ set-resources:
205207
velo_to_sce:
206208
mem: 30G
207209
velocyto:
208-
mem: 20G
210+
mem: 20G

snakePipes/shared/profiles/snakepipes_genericprofile/config.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ cluster-generic-submit-cmd:
66
--cpus-per-task={threads} \
77
--mem-per-cpu={resources.mem} \
88
--job-name=snakePipes_{rule}-{wildcards} \
9-
--output=logs/{rule}/{rule}-{wildcards}-%j.out \
9+
--output=logs/{rule}/{rule}-{wildcards}-%j-%N.out \
1010
--time={resources.time} \
1111
--parsable"
1212
jobs: 20
@@ -72,6 +72,8 @@ set-resources:
7272
mem: 2G
7373
CollectInsertSizeMetrics:
7474
mem: 1G
75+
conversionRate:
76+
mem: 4G
7577
correct_matrix:
7678
mem: 7G
7779
CpG_report:

snakePipes/shared/rscripts/CSAW.R

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,9 +118,13 @@ if (! external_bed) {
118118
})
119119
} else {
120120
allpeaks = lapply(snakemake@input[['peaks']], function(x) {
121-
bed = read.delim(paste0("../", x), header=FALSE)
122-
bed.gr = GRanges(seqnames = bed$V1, ranges = IRanges(start = bed$V2, end = bed$V3), name = bed$V4)
123-
return(bed.gr)
121+
peakfile<-paste0("../", x)
122+
if(file.exists(peakfile) & file.info(peakfile)$size > 0){
123+
bed = read.delim(peakfile, header=FALSE)
124+
bed.gr = GRanges(seqnames = bed$V1, ranges = IRanges(start = bed$V2, end = bed$V3), name = bed$V4)
125+
}else{message(paste0("Skipping peakfile ",peakfile))
126+
bed.gr=GRanges(c(seqnames=NULL,ranges=NULL,strand=NULL,name=NULL))}
127+
return(bed.gr)
124128
})
125129
}
126130
# merge

snakePipes/shared/rscripts/DB_functions.R

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -343,6 +343,8 @@ writeOutput_chip <- function(chipResultObject, outfile_prefix, fdrcutoff,lfccuto
343343
full_res[,2]<-full_res[,2]-1
344344
full_res[,2]<-format(full_res[,2], scientific = FALSE,trim=TRUE)
345345
full_res[,3]<-format(full_res[,3], scientific = FALSE,trim=TRUE)
346+
#write full results
347+
write.table(full_res,file="Full.results.bed",row.names=FALSE,col.names=FALSE,sep="\t",quote=FALSE)
346348
##filter full result for FDR and LFC and write to output
347349
full_res.filt<-subset(full_res,(FDR<=fdrcutoff)&(abs(best.logFC)>=lfccutoff))
348350
if(nrow(full_res.filt)>0){

0 commit comments

Comments
 (0)