Skip to content

Commit 189c835

Browse files
committed
Updates for issue #94
- fix issue #94 (#94) by passing unsorted raw Bismark alignment output to deduplication step - updated versions of all tools to latest - updated broken documentation links (updates to TrimGalore and Bismark included enhancements/new locations for tool documentation)
1 parent 90d6bb5 commit 189c835

File tree

1 file changed

+69
-40
lines changed

1 file changed

+69
-40
lines changed

Methyl-Seq/Pipeline_GL-DPPD-7113_Versions/GL-DPPD-7113.md

Lines changed: 69 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -63,19 +63,21 @@ Jonathan Galazka (GeneLab Project Scientist)
6363
# Software used
6464

6565
|Program|Version|Relevant Links|
66-
|:------|:-----:|------:|
67-
|FastQC| 0.11.9 |[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
68-
|MultiQC| 1.12 |[https://multiqc.info/](https://multiqc.info/)|
69-
|Cutadapt| 3.7 |[https://cutadapt.readthedocs.io/en/stable/](https://cutadapt.readthedocs.io/en/stable/)|
70-
|TrimGalore!| 0.6.7 |[https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)|
71-
|Bismark| 0.23.1 |[https://github.com/FelixKrueger/Bismark](https://github.com/FelixKrueger/Bismark)|
72-
|bowtie2| 2.4.4 |[https://github.com/BenLangmead/bowtie2#overview](https://github.com/BenLangmead/bowtie2#overview)|
73-
|hisat2| 2.2.1 | [https://github.com/DaehwanKimLab/hisat2](https://github.com/DaehwanKimLab/hisat2)|
74-
|samtools| 1.16.1 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
75-
|qualimap| 2.2.2d |[http://qualimap.conesalab.org/](http://qualimap.conesalab.org/)|
76-
|methylKit| 1.20.0 |[https://bioconductor.org/packages/release/bioc/html/methylKit.html](https://bioconductor.org/packages/release/bioc/html/methylKit.html)|
77-
|tidyverse| 1.3.2 |[https://tidyverse.tidyverse.org/](https://tidyverse.tidyverse.org/)|
78-
|genomation| 1.26.0 |[https://bioconductor.org/packages/release/bioc/html/genomation.html](https://bioconductor.org/packages/release/bioc/html/genomation.html)|
66+
|:-----------|:------:|------:|
67+
|FastQC | 0.12.0 |[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)|
68+
|MultiQC | 1.22.1 |[https://multiqc.info/](https://multiqc.info/)|
69+
|Cutadapt | 4.8 |[https://cutadapt.readthedocs.io/en/stable/](https://cutadapt.readthedocs.io/en/stable/)|
70+
|TrimGalore! | 0.6.7 |[https://github.com/FelixKrueger/TrimGalore](https://github.com/FelixKrueger/TrimGalore)|
71+
|Bismark | 0.24.2 |[https://github.com/FelixKrueger/Bismark](https://github.com/FelixKrueger/Bismark)|
72+
|bowtie2 | 2.5.4 |[https://github.com/BenLangmead/bowtie2#overview](https://github.com/BenLangmead/bowtie2#overview)|
73+
|hisat2 | 2.2.1 |[https://github.com/DaehwanKimLab/hisat2](https://github.com/DaehwanKimLab/hisat2)|
74+
|samtools | 1.20 |[https://github.com/samtools/samtools#samtools](https://github.com/samtools/samtools#samtools)|
75+
|qualimap | 2.3 |[http://qualimap.conesalab.org/](http://qualimap.conesalab.org/)|
76+
|R | 4.4.0 |[https://www.r-project.org](https://www.r-project.org)|
77+
|tidyverse | 2.0.0 |[https://tidyverse.tidyverse.org/](https://tidyverse.tidyverse.org/)|
78+
|Bioconductor| 3.19 |[https://bioconductor.org](https://bioconductor.org)
79+
|methylKit | 1.30.0 |[https://bioconductor.org/packages/release/bioc/html/methylKit.html](https://bioconductor.org/packages/release/bioc/html/methylKit.html)|
80+
|genomation | 1.36.0 |[https://bioconductor.org/packages/release/bioc/html/genomation.html](https://bioconductor.org/packages/release/bioc/html/genomation.html)|
7981

8082
---
8183

@@ -139,12 +141,12 @@ multiqc --interactive -o raw_multiqc_data/ -n raw_multiqc -z raw_fastqc_output/
139141
---
140142

141143
## 2. Adapter trimming/quality filtering
142-
See `trim_galore --help` [menu](https://github.com/FelixKrueger/TrimGalore/blob/072ecf9a1f80f9eb41c8116c32284492f481cbbb/trim_galore#L3035) for more info on any of the below.
144+
See `trim_galore --help` or [TrimGalore User Guide](https://github.com/FelixKrueger/TrimGalore/blob/0.6.10/Docs/Trim_Galore_User_Guide.md) for more info on any of the below.
143145

144146
<br>
145147

146148
### If not RRBS or if RRBS using MseI digestion
147-
Note that the `--rrbs` option is **not** appropriate when RRBS (reduced representation bisulfite sequencing) libraries were prepared with MseI digestion (see `trim_galore --help` menu [(starting at this line)](https://github.com/FelixKrueger/TrimGalore/blob/072ecf9a1f80f9eb41c8116c32284492f481cbbb/trim_galore#L3337).
149+
Note that the `--rrbs` option is **not** appropriate when RRBS (reduced representation bisulfite sequencing) libraries were prepared with MseI digestion (see the TrimGalore User Guide [Note for RRBS using MseI](https://github.com/FelixKrueger/TrimGalore/blob/0.6.10/Docs/Trim_Galore_User_Guide.md#rrbs-specific-options-mspi-digested-material).
148150

149151
**Single-end example**
150152

@@ -177,7 +179,7 @@ mv sample-1_R2_raw_val_2.fq.gz sample-1_R2_trimmed.fastq.gz
177179
<br>
178180

179181
### If RRBS with MspI digestion
180-
Note that if the library preparation was non-directional, the `--non_directional` flag needs to be added to this command (whether single-end or paired-end; see `trim_galore --help` menu [e.g., here](https://github.com/FelixKrueger/TrimGalore/blob/072ecf9a1f80f9eb41c8116c32284492f481cbbb/trim_galore#L3315)).
182+
Note that if the library preparation was non-directional, the `--non_directional` flag needs to be added to this command (whether single-end or paired-end; see [TrimGalore User Guide](https://github.com/FelixKrueger/TrimGalore/blob/0.6.10/Docs/Trim_Galore_User_Guide.md#rrbs-specific-options-mspi-digested-material)).
181183

182184
**Single-end example**
183185

@@ -214,7 +216,7 @@ mv sample-1_R2_raw_val_2.fq.gz sample-1_R2_trimmed.fastq.gz
214216
### If RRBS with NuGEN ovation kit
215217
Libraries prepared with the NuGEN ovation kit need to be procesed with an additional script provided by the company's [github](https://github.com/nugentechnologies/NuMetRRBS#analysis-guide-for-nugen-ovation-rrbs-methyl-seq).
216218

217-
Following their instructions, we first run an adapter-trimming/quality-filtering step with trimgalore. Note that the `--rrbs` option is not appropriate to pass to trimgalore when this kit is used (see `trim_galore --help` menu [(starting at this line)](https://github.com/FelixKrueger/TrimGalore/blob/072ecf9a1f80f9eb41c8116c32284492f481cbbb/trim_galore#L3329). Then we utilize the company's script to remove the random diversity sequences added by the kit.
219+
Following their instructions, we first run an adapter-trimming/quality-filtering step with trimgalore. Note that the `--rrbs` option is not appropriate to pass to trimgalore when this kit is used (see Bismark documentation for [RRBS NuGEN Ovation Methyl-Seq System](http://felixkrueger.github.io/Bismark/bismark/library_types/#rrbs-nugen-ovation-methyl-seq-system). Then we utilize the company's script to remove the random diversity sequences added by the kit.
218220

219221
#### First adapter-trimming/quality-filtering with trimgalore
220222

@@ -424,7 +426,7 @@ bismark_genome_preparation --bowtie2 \
424426

425427
### 4b. Align
426428

427-
Note that if the library preparation was non-directional, the `--non_directional` flag needs to be added to this command (whether single-end or paired-end).
429+
Note that if the library preparation was non-directional, the `--non_directional` flag needs to be added to this command (whether single-end or paired-end). For a full list of alignment option recommendations library type and/or commercially available kit, please see the library page in the [Bismark documentation](http://felixkrueger.github.io/Bismark/bismark/library_types/)
428430

429431
**Single-end example**
430432

@@ -438,8 +440,8 @@ bismark --bowtie2 \
438440
--genome_folder bismark_reference_genome/ \
439441
sample-1_trimmed.fastq.gz
440442

441-
# renaming output files so they are cleaner and will work with sorted bam file/auto-detection
442-
# of bismark2summary later
443+
# renaming output files so they are cleaner and will work with sorted bam
444+
# file/auto-detection of bismark2summary later
443445
mv sample-1_trimmed_bismark_bt2_SE_report.txt sample-1_bismark_bt2_sorted_SE_report.txt
444446
mv sample-1_trimmed_bismark_bt2.nucleotide_stats.txt sample-1_bismark_bt2.nucleotide_stats.txt
445447
mv sample-1_trimmed_bismark_bt2.bam sample-1_bismark_bt2.bam
@@ -458,8 +460,8 @@ bismark --bowtie2 \
458460
-1 sample-1_R1_trimmed.fastq.gz \
459461
-2 sample-1_R2_trimmed.fastq.gz
460462

461-
# renaming output files so they are cleaner and will work with sorted bam file/auto-detection
462-
# of bismark2summary later
463+
# renaming output files so they are cleaner and will work with sorted bam
464+
# file/auto-detection of bismark2summary later
463465
mv sample-1_R1_trimmed_bismark_bt2_PE_report.txt sample-1_bismark_bt2_sorted_PE_report.txt
464466
mv sample-1_R1_trimmed_bismark_bt2.nucleotide_stats.txt sample-1_bismark_bt2.nucleotide_stats.txt
465467
mv sample-1_R1_trimmed_bismark_bt2_pe.bam sample-1_bismark_bt2_pe.bam
@@ -557,28 +559,53 @@ qualimap bamqc -bam sample-1_bismark_bt2_sorted.bam \
557559

558560
---
559561

560-
## 6. Deduplicate (skip if data are RRBS)
562+
## 6a. Deduplicate (skip if data are RRBS)
561563
> **NOTE**
562-
> This step should **not** be done if the data are RRBS (reduced representation bisulfite sequencing; see e.g., [bismark documentation](https://github.com/FelixKrueger/Bismark/tree/master/Docs#iii-running-deduplicate_bismark)).
564+
> This step should **not** be done if the data are RRBS (reduced representation bisulfite sequencing; see e.g., [bismark documentation](https://felixkrueger.github.io/Bismark/bismark/deduplication/)).
563565
564566
```bash
565-
deduplicate_bismark sample-1_bismark_bt2_sorted.bam
567+
deduplicate_bismark sample-1_bismark_bt2.bam
566568
```
567569

568570
**Parameter Definitions:**
569571

570-
* sample-1_bismark_bt2_sorted.bam - the input bam file, provided as a positional argument
572+
* sample-1_bismark_bt2.bam - the input bam file, provided as a positional argument
571573

572574
**Input data:**
573575

574-
* sample-1_bismark_bt2_sorted.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above)
576+
* sample-1_bismark_bt2.bam (unsorted bismark bowtie2 alignment bam file, output from [Step 4b](#4b-align) above)
575577

576578
**Output data:**
577579

578-
* **\*.deduplicated.bam** (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, with duplicates removed)
580+
* \*.deduplicated.bam (unsorted bismark bowtie2 alignment bam file, with duplicates removed)
579581
* **\*.deduplication_report.txt** (report file containing deduplication information)
580582

581583

584+
<br>
585+
586+
### 6b. Sort Deduplicated Alignment Files
587+
588+
```bash
589+
samtools sort -@ 4 \
590+
-o sample-1_bismark_bt2_sorted.deduplicated.bam \
591+
sample-1_bismark_bt2.deduplicated.bam
592+
```
593+
594+
**Parameter Definitions:**
595+
596+
* `sort` - specifies to use the `sort` sub-program of `samtools`
597+
* `-@` - specifies the number of threads to use
598+
* `-o` - specifies the output file name
599+
* sample-1_bismark_bt2.deduplicated.bam - the input bam file, provided as a positional argument
600+
601+
**Input data:**
602+
603+
* sample-1_bismark_bt2.deduplicated.bam (bismark bowtie2 alignment bam file, output from [Step 6a.](#6a-deduplicate-skip-if-data-are-rrbs) above)
604+
605+
**Output data:**
606+
607+
* **sample-1_bismark_bt2_sorted.deduplicated.bam** (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, with duplicated removed)
608+
582609
<br>
583610

584611
---
@@ -597,8 +624,9 @@ bismark_methylation_extractor --parallel 4 \
597624
--cytosine_report \
598625
--genome_folder bismark_reference_genome/ \
599626
sample-1_bismark_bt2_sorted.bam
600-
# note, if *not working with RRBS data, input should be the
601-
# deduplicated version (sample-1_bismark_bt2_sorted.deduplicated.bam) produced in step 6 above
627+
# note, if *not working with RRBS data, input should be the deduplicated
628+
# version (sample-1_bismark_bt2_sorted.deduplicated.bam) produced in
629+
# step 6 above
602630
```
603631

604632
**Paired-end example**
@@ -614,37 +642,38 @@ bismark_methylation_extractor --parallel 4 \
614642
--ignore_r2 2 \
615643
--ignore_3prime_r2 2 \
616644
sample-1_bismark_bt2_sorted.bam
617-
# note, if *not working with RRBS data, input should be the
618-
# deduplicated version (sample-1_bismark_bt2_sorted.deduplicated.bam) produced in step 6 above
645+
# note, if *not working with RRBS data, input should be the deduplicated
646+
# version (sample-1_bismark_bt2_sorted.deduplicated.bam) produced in
647+
# step 6 above
619648
```
620649

621650

622651
**Parameter Definitions:**
623652

624653
* `--parallel` - specifies the number of cores to use for methylation extraction, note: the program will utilize ~3X the number specified
625-
* `--bedGraph` - instructs the program to generate a sorted bedGraph file that reports the position of a given cytosine and its methlyation state (by default, only methylated CpGs are reported - see bismark docs [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#optional-bedgraph-output) for more info)
654+
* `--bedGraph` - instructs the program to generate a sorted bedGraph file that reports the position of a given cytosine and its methlyation state (by default, only methylated CpGs are reported - see bedgraph options in [bismark documentation](https://felixkrueger.github.io/Bismark/options/methylation_extraction/#bedgraph-specific-options) for more info)
626655
* `--gzip` - specifies to gzip-compress the methylation extractor output files
627656
* `--comprehensive` - specifies to merge all four possible strand-specific methylation info into context-dependent output files
628657
* `--output_dir` - the output directory to store results
629658
* `--cytosine_report` - instructions the program to produce a genome-wide methylation report for all cytosines in the genome
630659
* `--genome_folder` - a directory holding the reference genome in fasta format (this pipeline version uses the Ensembl fasta file indicated in the `fasta` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
631-
* `--ignore_r2` - specifies how many bases to ignore from the 5' end of the reverse reads (bismark docs recommend 2, see [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#appendix-iii-bismark-methylation-extractor))
660+
* `--ignore_r2` - specifies how many bases to ignore from the 5' end of the reverse reads (bismark docs recommend 2, see [bismark documentation](https://felixkrueger.github.io/Bismark/options/methylation_extraction/#options))
632661
> Note: The first couple of bases in the reverse read of bisulfite sequence experiments show a severe bias towards non-methylation as a result of end-reparing sonicated fragments with unmentulated cytosines, so it is recommened to remove the first couple basepairs
633-
* `--ignore_3prime_r2` - specifies how many bases to ignore from the 3' end of the reverse reads to remove unwanted biases from the end of reads (this is utilized in the [nf-core methylseq workflow](https://nf-co.re/methylseq), set at [this line](https://github.com/nf-core/methylseq/blob/03972a686bedeb2920803cd575f4d671e9135af0/main.nf#L643))
662+
* `--ignore_3prime_r2` - specifies how many bases to ignore from the 3' end of the reverse reads to remove unwanted biases from the end of reads (For specific recommnendations see Bismark documentation on [Library Types](https://felixkrueger.github.io/Bismark/bismark/library_types/))
634663
* sample-1_bismark_bt2_sorted.bam - the input bam file, provided as a positional argument
635664

636665
**Input data:**
637666

638-
* sample-1_bismark_bt2_sorted*.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above if data are RRBS, or deduplicated bam file from [step 6](#6-deduplicate-skip-if-data-are-rrbs) if data are not RRBS and the bam file was deduplicated (e.g., sample-1_bismark_bt2_sorted.deduplicated.bam from above))
667+
* sample-1_bismark_bt2_sorted*.bam (bismark bowtie2 alignment bam file sorted by chromosomal coordinates, output from [Step 4c](#4c-sort-alignment-files) above if data are RRBS, or deduplicated bam file from [step 6](#6b-sort-deduplicated-alignment-files) if data are not RRBS and the bam file was deduplicated (e.g., sample-1_bismark_bt2_sorted.deduplicated.bam from above))
639668
* a directory holding the reference genome in fasta format (this pipeline version uses the Ensembl fasta file indicated in the `fasta` column of the [GL-DPPD-7110_annotations.csv](../../GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv) GeneLab Annotations file))
640669

641670

642671
**Output data:**
643672

644-
* **\*\_context\_\*.txt.gz** (bismark methylation-call files for CpG, CHG, and CHH contexts that were detected; see [bismark documentation](https://github.com/FelixKrueger/Bismark/tree/master/Docs), namely [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#methylation-call) for symbols, and [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#iv-bismark-methylation-extractor) for file format)
645-
* **\*.bedGraph.gz** (gzip-compressed bedGraph-formatted file of methylation percentages of each CpG site; see bismark docs [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#optional-bedgraph-output))
646-
* **\*.bismark.cov.gz** (gzip-compressed bedGraph-formatted file like above "\*.bedGraph.gz", but also including 2 more columns of methylated and unmethylated counts at the specified position; see bismark docs [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#optional-bedgraph-output))
647-
* **\*.M-bias.txt** (text file with methylation information in the context of the position in reads, helpful for investigating bias as a function of base position in the read; see bismark documentation [here](https://github.com/FelixKrueger/Bismark/tree/master/Docs#m-bias-plot))
673+
* **\*\_context\_\*.txt.gz** (bismark methylation-call files for CpG, CHG, and CHH contexts that were detected; see [bismark documentation](https://felixkrueger.github.io/Bismark/), namely [methylation call](http://felixkrueger.github.io/Bismark/bismark/alignment/#methylation-call) for symbols, and [methylation extraction output](http://felixkrueger.github.io/Bismark/bismark/methylation_extraction/#the-methylation-extractor-output-looks-like-this-tab-separated) for file format)
674+
* **\*.bedGraph.gz** (gzip-compressed bedGraph-formatted file of methylation percentages of each CpG site; see [bismark documentation](https://github.com/FelixKrueger/Bismark/tree/master/Docs#optional-bedgraph-output))
675+
* **\*.bismark.cov.gz** (gzip-compressed bedGraph-formatted file like above "\*.bedGraph.gz", but also including 2 more columns of methylated and unmethylated counts at the specified position; see [bismark documentation](https://felixkrueger.github.io/Bismark/options/methylation_extraction/#bedgraph-specific-options))
676+
* **\*.M-bias.txt** (text file with methylation information in the context of the position in reads, helpful for investigating bias as a function of base position in the read; see [bismark documentation](http://felixkrueger.github.io/Bismark/bismark/methylation_extraction/#m-bias-plot))
648677
* **\*_splitting_report.txt** (text file containing general methylation detection information)
649678
* **\*.cytosine_context_summary.txt** (tsv file of detected cytosine-methylation information summed by nucleotide context)
650679
* **\*.CpG_report.txt.gz** (a genome-wide methylation report for all CpG cytosines)

0 commit comments

Comments
 (0)