Skip to content

Commit e1795c4

Browse files
authored
Merge pull request #7 from IARCbioinfo/dev
Dev
2 parents ae889d3 + ce432b9 commit e1795c4

18 files changed

+2391
-759
lines changed

Dockerfile

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Set the base image to Debian
2+
FROM debian:9.0
3+
4+
# File Author / Maintainer
5+
MAINTAINER **nalcala** <**alcalan@fellows.iarc.fr**>
6+
7+
RUN mkdir -p /var/cache/apt/archives/partial && \
8+
touch /var/cache/apt/archives/lock && \
9+
chmod 640 /var/cache/apt/archives/lock && \
10+
apt-get update -y &&\
11+
apt-get install -y gnupg2
12+
13+
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys F76221572C52609D && \
14+
apt-get clean && \
15+
apt-get update -y && \
16+
17+
18+
# Install dependences
19+
DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
20+
make \
21+
g++ \
22+
perl \
23+
default-jre \
24+
zlib1g-dev \
25+
libncurses5-dev \
26+
libncurses5 \
27+
git \
28+
wget \
29+
ca-certificates \
30+
python-dev \
31+
python-pip \
32+
bzip2 \
33+
libbz2-dev \
34+
liblzma-dev \
35+
libcurl4-openssl-dev \
36+
libfreetype6-dev \
37+
libpng-dev \
38+
unzip \
39+
r-base \
40+
r-cran-ggplot2 \
41+
r-cran-gplots \
42+
r-cran-reshape && \
43+
cp /usr/include/freetype2/*.h /usr/include/. && \
44+
45+
Rscript -e 'install.packages("gsalib",repos="http://cran.us.r-project.org")' && \
46+
47+
# Install samtools specific version manually
48+
wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2 && \
49+
tar -jxf samtools-1.3.1.tar.bz2 && \
50+
cd samtools-1.3.1 && \
51+
make && \
52+
make install && \
53+
cd .. && \
54+
rm -rf samtools-1.3.1 samtools-1.3.1.tar.bz2 && \
55+
56+
# Install FastQC
57+
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip && \
58+
unzip fastqc_v0.11.5.zip && \
59+
chmod 755 FastQC/fastqc && \
60+
cp -r FastQC /usr/local/bin/. && \
61+
ln -s /usr/local/bin/FastQC/fastqc /usr/local/bin/ && \
62+
rm -rf fastqc_v0.11.5.zip FastQC && \
63+
64+
# Install cutadapt
65+
pip install cutadapt && \
66+
67+
# Install trim_galore
68+
wget https://github.com/FelixKrueger/TrimGalore/archive/0.4.3.tar.gz && \
69+
tar xvzf 0.4.3.tar.gz && \
70+
mv TrimGalore-0.4.3/trim_galore /usr/bin && \
71+
rm -rf TrimGalore-0.4.3 0.4.3.tar.gz && \
72+
73+
# Install hisat2
74+
75+
# Install htseq
76+
pip install numpy && \
77+
pip install setuptools && \
78+
pip install HTSeq && \
79+
80+
# Install multiqc
81+
pip install --upgrade --force-reinstall git+https://github.com/nalcala/MultiQC.git && \
82+
83+
# Install STAR specific version manually
84+
wget https://github.com/alexdobin/STAR/archive/2.5.3a.tar.gz && \
85+
tar -xzf 2.5.3a.tar.gz && \
86+
cp STAR-2.5.3a/bin/Linux_x86_64_static/STAR /usr/local/bin/. && \
87+
rm -rf 2.5.3a.tar.gz STAR-2.5.3a && \
88+
89+
# Install hisat2
90+
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.1.0-Linux_x86_64.zip && \
91+
unzip hisat2-2.1.0-Linux_x86_64.zip && \
92+
cp -r hisat2-2.1.0/. /usr/local/bin/. && \
93+
rm -rf hisat2-2.1.0-Linux_x86_64.zip hisat2-2.1.0 && \
94+
95+
# Install RSeQC
96+
pip install RSeQC && \
97+
98+
# Install samblaster specific version manually
99+
wget https://github.com/GregoryFaust/samblaster/releases/download/v.0.1.24/samblaster-v.0.1.24.tar.gz && \
100+
tar -xzf samblaster-v.0.1.24.tar.gz && \
101+
cd samblaster-v.0.1.24 && \
102+
make && \
103+
cp samblaster /usr/local/bin/. && \
104+
cd .. && \
105+
rm -rf samblaster-v.0.1.24.tar.gz samblaster-v.0.1.24 && \
106+
107+
# Install sambamba specific version manually
108+
wget https://github.com/lomereiter/sambamba/releases/download/v0.6.6/sambamba_v0.6.6_linux.tar.bz2 && \
109+
tar -jxf sambamba_v0.6.6_linux.tar.bz2 && \
110+
cp sambamba_v0.6.6 /usr/local/bin/sambamba && \
111+
rm -rf sambamba_v0.6.6_linux.tar.bz2 && \
112+
113+
# Remove unnecessary dependences
114+
DEBIAN_FRONTEND=noninteractive apt-get remove -y \
115+
make \
116+
g++ \
117+
wget \
118+
bzip2 \
119+
git \
120+
zlib1g-dev \
121+
libncurses5-dev && \
122+
123+
# Clean
124+
DEBIAN_FRONTEND=noninteractive apt-get autoremove -y && \
125+
apt-get clean

README.md

Lines changed: 131 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,31 @@
11
# RNAseq-nf
2-
RNAseq mapping, quality control, and reads counting nextflow pipeline
32

4-
## Overview of pipeline workflow
3+
## Nextflow pipeline for RNA seq processing
4+
55
![workflow](RNAseqpipeline.png?raw=true "Scheme of alignment/realignment Workflow")
66

7-
## Prerequisites
7+
## Decription
8+
9+
Nextflow pipeline for RNA sequencing mapping, quality control, reads counting, and unsupervised analysis
10+
11+
## Dependencies
12+
13+
1. Nextflow : for common installation procedures see the [IARC-nf](https://github.com/IARCbioinfo/IARC-nf) repository.
814

9-
### General prerequisites
10-
The following programs need to be installed and in the PATH environment variable:
11-
- [*fastqc*](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/INSTALL.txt)
12-
- [*cutadapt*](http://cutadapt.readthedocs.io/en/stable/installation.html), which requires Python version > 2.7
13-
- [*trim_galore*](https://github.com/FelixKrueger/TrimGalore)
14-
- [*RESeQC*](http://rseqc.sourceforge.net/)
15-
- [*multiQC*](http://multiqc.info/docs/)
16-
- [*STAR*](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf)
17-
- [*htseq*](http://www-huber.embl.de/HTSeq/doc/install.html#install); the python script htseq-count must also be in the PATH
18-
- [*nextflow*](https://www.nextflow.io/docs/latest/getstarted.html)
15+
2. [*fastqc*](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/INSTALL.txt)
16+
3. [*cutadapt*](http://cutadapt.readthedocs.io/en/stable/installation.html), which requires Python version > 2.7
17+
4. [*trim_galore*](https://github.com/FelixKrueger/TrimGalore)
18+
5. [*RESeQC*](http://rseqc.sourceforge.net/)
19+
6. [*multiQC*](http://multiqc.info/docs/)
20+
7. [*STAR*](https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf)
21+
8. [*htseq*](http://www-huber.embl.de/HTSeq/doc/install.html#install); the python script htseq-count must also be in the PATH
1922

2023
In addition, STAR requires genome indices that can be generated from a genome fasta file ref.fa and a splice junction annotation file ref.gtf using the following command:
2124
```bash
2225
STAR --runThreadN n --runMode genomeGenerate --genomeDir ref --genomeFastaFiles ref.fa --sjdbGTFfile ref.gtf --sjdbOverhang 99
2326
```
2427

25-
### Prerequisites for alignment with hisat2
28+
### Alignment with hisat2
2629
In order to perform the optional alignment with hisat2, hisat2 must be installed:
2730
- [*hisat2*](https://ccb.jhu.edu/software/hisat2/index.shtml)
2831

@@ -33,7 +36,7 @@ extract_exons.py reference.gtf > genome.exon
3336
hisat2-build reference.fa --ss genome.ss --exon genome.exon genome_tran
3437
```
3538

36-
### Prerequisites for reads trimming at splice junctions
39+
### Reads trimming at splice junctions
3740
In order to perform the optional reads trimming at splice junctions, GATK must be installed:
3841
- GATK [*GenomeAnalysisTK.jar*](https://software.broadinstitute.org/gatk/guide/quickstart)
3942

@@ -43,58 +46,136 @@ samtools faidx ref.fa
4346
java -jar picard.jar CreateSequenceDictionary R= ref.fa O= ref.dict
4447
```
4548

46-
### Prerequisites for base quality score recalibration
49+
### Base quality score recalibration
50+
In order to perform the optional base quality score recalibration, several files are required:
4751
- GATK [*GenomeAnalysisTK.jar*](https://software.broadinstitute.org/gatk/guide/quickstart)
4852
- [GATK bundle](https://software.broadinstitute.org/gatk/download/bundle) VCF files with lists of indels and SNVs (recommended: 1000 genomes indels, Mills gold standard indels VCFs, dbsnp VCF)
4953
- bed file with intervals to be considered
5054

55+
### Clustering
56+
In order to perform the optional unsupervised analysis of read counts (PCA and consensus clustering), you need:
57+
- the unsupervised analysis R script [*RNAseq_unsupervised.R*](https://github.com/IARCbioinfo/RNAseq_analysis_scripts); this script must be in a floder of the path variable (e.g., in /usr/bin/)
58+
- [R and Rscript](https://cran.r-project.org) with packages ConsensusClusterPlus, ade4, DESeq2, fpc, and cluster
59+
60+
## Input
61+
| Type | Description |
62+
|-----------|---------------|
63+
| --input_folder | a folder with fastq files or bam files |
64+
65+
66+
## Parameters
67+
68+
* #### Mandatory
69+
| Name | Example value | Description |
70+
|-----------|--------------:|-------------|
71+
| --input_folder | . | input folder |
72+
|--ref_folder | ref | reference genome folder |
73+
|--gtf | Homo_sapiens.GRCh38.79.gtf | annotation GTF file |
74+
|--bed | gene.bed | bed file with genes for RESeQC |
75+
76+
77+
* #### Optional
78+
79+
| Name | Default value | Description |
80+
|-----------|--------------|-------------|
81+
|--cpu | 4 | number of CPUs |
82+
|--mem | 50 | memory for mapping|
83+
|--mem_QC | 2 | memory for QC and counting|
84+
|--fastq_ext | fq.gz | extension of fastq files|
85+
|--suffix1 | \_1 | suffix for second element of read files pair|
86+
|--suffix2 | \_2 | suffix for second element of read files pair|
87+
|--output_folder | . | output folder for aligned BAMs|
88+
|--ref | ref.fa | reference genome fasta file for GATK |
89+
|--GATK_jar | GenomeAnalysisTK.jar | path to jar file GenomeAnalysisTK.jar |
90+
|--GATK_bundle | GATK_bundle | folder with files for BQSR |
91+
|--RG | PL:ILLUMINA | string to be added to read group information in BAM file |
92+
|--stranded | no | Strand information for counting with htseq [no, yes, reverse] |
93+
|--hisat2_idx | genome_tran | index filename prefix for hisat2 |
94+
|--clustering_n | 500 | number of genes to use for clustering |
95+
|--clustering_t | "vst" | count transformation method; 'rld', 'vst', or 'auto' |
96+
|--clustering_c | "hc" | clustering algorithm to be passed to ConsensusClusterPlus |
97+
|--clustering_l | "complete" | method for hierarchical clustering to be passed to ConsensusClusterPlus |
98+
|--htseq_maxreads| null | maximum number of reads in the htseq buffer; if null, uses the default htseq value 30,000,000 |
99+
100+
* #### Flags
101+
102+
| Name | Description |
103+
|-----------|-------------|
104+
|--help | print usage and optional parameters |
105+
|--sjtrim | enable reads trimming at splice junctions |
106+
|--hisat2 | use hisat2 instead of STAR for mapping |
107+
|--recalibration | perform quality score recalibration (GATK)|
108+
|--clustering | perform unsupervised analyses of read counts data|
109+
110+
51111
## Usage
52-
To run the pipeline on a series of paired-end fastq files (with suffixes *_1* and *_2*) in folder *fastq*, and a reference genome with indexes in folder *ref_genome*, one can type:
112+
To run the pipeline on a series of paired-end fastq files (with suffixes *_1* and *_2*) in folder *fastq*, a reference genome with indexes in folder *ref_genome*, an annotation file ref.gtf, and a bed file ref.bed, one can type:
53113
```bash
54-
nextflow run iarcbioinfo/RNAseq-nf --input_folder fastq --gendir ref_genome --suffix1 _1 --suffix2 _2
114+
nextflow run iarcbioinfo/RNAseq-nf --input_folder fastq --ref_folder ref_genome --gtf ref.gtf --bed ref.bed
55115
```
56116
### Use hisat2 for mapping
57-
To use the reads trimming at splice junctions step, you must add the ***--hisat2* option**, specify the path to the folder containing the hisat2 index files, as well as satisfy the requirements above mentionned. For example:
117+
To use hisat2 instead of STAR for the reads mapping, you must add the ***--hisat2* option**, specify the path to the folder containing the hisat2 index files (genome_tran.1.ht2 to genome_tran.8.ht2), as well as satisfy the requirements above mentionned. For example:
58118
```bash
59-
nextflow run iarcbioinfo/RNAseq-nf --input_folder fastq --suffix1 _1 --suffix2 _2 --hisat2 --hisat2_idx /home/user/reference/genome_tran
119+
nextflow run iarcbioinfo/RNAseq-nf --input_folder fastq --ref_folder ref_genome --gtf ref.gtf --bed ref.bed --hisat2 --hisat2_idx genome_tran
60120
```
121+
Note that parameter '--hisat2_idx' is the prefix of the index files, not the entire path to .ht2 files.
122+
61123
### Enable reads trimming at splice junctions
62124
To use the reads trimming at splice junctions step, you must add the ***--sjtrim* option**, specify the path to the folder containing the GenomeAnalysisTK jar file, as well as satisfy the requirements above mentionned. For example:
63125
```bash
64-
nextflow run iarcbioinfo/RNAseq-nf --input_folder fastq --gendir ref_genome --suffix1 _1 --suffix2 _2 --sjtrim --GATK_folder /home/user/GATK
126+
nextflow run iarcbioinfo/RNAseq-nf --input_folder fastq --ref_folder ref_genome --gtf ref.gtf --bed ref.bed --sjtrim --GATK_jar /home/user/GATK/GenomeAnalysisTK.jar
65127
```
66128

67129
### Enable Base Quality Score Recalibration
68130
To use the base quality score recalibration step, you must add the ***--bqsr* option**, specify the path to the folder containing the GenomeAnalysisTK jar file, the path to the GATK bundle folder for your reference genome, specify the path to the bed file with intervals to be considered, as well as satisfy the requirements above mentionned. For example:
69131
```bash
70-
nextflow run iarcbioinfo/RNAseq-nf --input_folder fastq --gendir ref_genome --suffix1 _1 --suffix2 _2 --bqsr --GATK_folder /home/user/GATK --GATK_bundle /home/user/GATKbundle --intervals intervals.bed
132+
nextflow run iarcbioinfo/RNAseq-nf --input_folder fastq --ref_folder ref_genome --gtf ref.gtf --bed ref.bed --recalibration --GATK_jar /home/user/GATK/GenomeAnalysisTK.jar --GATK_bundle /home/user/GATKbundle
71133
```
72134

73-
## All parameters
74-
| **PARAMETER** | **DEFAULT** | **DESCRIPTION** |
75-
|-----------|--------------:|-------------|
76-
| *--help* | null | print usage and optional parameters |
77-
*--input_folder* | . | input folder |
78-
*--output_folder* | . | output folder |
79-
*--gendir* | ref | reference genome folder |
80-
*--cpu* | 4 | number of CPUs |
81-
*--mem* | 50 | memory for mapping|
82-
*--memOther* | 2 | memory for QC and counting|
83-
*--fastq_ext* | fq.gz | extension of fastq files|
84-
*--suffix1* | \_1 | suffix for second element of read files pair|
85-
*--suffix2* | \_2 | suffix for second element of read files pair|
86-
*--output_folder* | . | output folder for aligned BAMs|
87-
*--annot_gtf* | Homo_sapiens.GRCh38.79.gtf | annotation GTF file |
88-
*--annot_gff* | Homo_sapiens.GRCh38.79.gff | annotation GFF file |
89-
*--fasta_ref* | ref.fa | reference genome fasta file for GATK |
90-
*--GATK_folder* | GATK | folder with jar file GenomeAnalysisTK.jar |
91-
*--GATK_bundle* | GATK_bundle | folder with files for BQSR |
92-
*--intervals* | intervals.bed | bed file with intervals for BQSR |
93-
*--RG* | PL:ILLUMINA | string to be added to read group information in BAM file |
94-
*--sjtrim* | false | enable reads trimming at splice junctions |
95-
*--bqsr* | false | enable base quality score recalibration |
96-
*--gene_bed* | gene.bed | bed file with genes for RESeQC |
97-
*--stranded* | no | Strand information for counting with htseq [no, yes, reverse] |
98-
*--stranded* | no | Strand information for counting with htseq [no, yes, reverse] |
99-
*--hisat2* | false | use hisat2 instead of STAR for mapping |
100-
*--hisat2_idx* | genome_tran | index filename prefix for hisat2 |
135+
### Perform unsupervised analysis
136+
To use the unsupervised analysis step, you must add the ***--clustering* option**, and satisfy the requirements above mentionned. For example:
137+
```bash
138+
nextflow run iarcbioinfo/RNAseq-nf --input_folder fastq --ref_folder ref_genome --gtf ref.gtf --bed ref.bed --clustering
139+
```
140+
You can also specify options n, t, c, and l (see [*RNAseq_unsupervised.R*](https://github.com/IARCbioinfo/RNAseq_analysis_scripts)) of script RNAseq_unsupervised.R using options '--clustering_n', '--clustering_t', '--clustering_c', and '--clustering_l'.
141+
142+
143+
## Output
144+
| Type | Description |
145+
|-----------|---------------|
146+
| file.bam | BAM files of alignments or realignments |
147+
| file.bam.bai | BAI files of alignments or realignments |
148+
| file_{12}.fq.gz_trimming_report.txt | trim_galore report |
149+
|multiqc_pretrim_report.html | multiqc report before trimming |
150+
|multiqc_pretrim_report_data | folder with data used to compute multiqc report before trimming |
151+
|multiqc_posttrim_report.html | multiqc report before trimming |
152+
|multiqc_posttrim_report_data | folder with data used to compute multiqc report before trimming |
153+
|STAR.file.Log.final.out| STAR log |
154+
|file_readdist.txt | RSeQC report |
155+
|file_count.txt | htseq-count output file |
156+
| file_target_intervals.list | list of intervals used |
157+
| file_recal.table | table of scores before recalibration |
158+
| file_post_recal.table | table of scores after recalibration |
159+
| file_recalibration_plots.pdf | before/after recalibration plots |
160+
161+
162+
## Directed Acyclic Graph
163+
164+
### With default options
165+
[![DAG STAR](dag_STAR.png)](http://htmlpreview.github.io/?https://github.com/IARCbioinfo/RNAseq-nf/blob/dev/dag_STAR.html)
166+
167+
### With option --hisat2
168+
[![DAG hisat2](dag_hisat2.png)](http://htmlpreview.github.io/?https://github.com/IARCbioinfo/RNAseq-nf/blob/dev/dag_hisat2.html)
169+
170+
### With options --sjtrim and --recalibration
171+
[![DAG STAR_sjtrim_recal](dag_STAR_sjtrim_recal.png)](http://htmlpreview.github.io/?https://github.com/IARCbioinfo/RNAseq-nf/blob/dev/dag_STAR_sjtrim_recal.html)
172+
173+
## Contributions
174+
175+
| Name | Email | Description |
176+
|-----------|---------------|-----------------|
177+
| Nicolas Alcala* | AlcalaN@fellows.iarc.fr | Developer to contact for support |
178+
| Noemie Leblay | LeblayN@students.iarc.fr | Tester |
179+
| Alexis Robitaille | RobitailleA@students.iarc.fr | Tester |
180+
181+

0 commit comments

Comments
 (0)