Bulk RNA sequencing (bulk RNA-seq) allows the analysis of the average gene expression in a population of cells. The transcriptomic profile obtained from bulk RNA-seq represents an average across all cells, unlike single-cell RNA sequencing (scRNA-seq), which enables the study of gene expression at the level of individual cells.
This pipeline includes the following key steps:
- RNA Library Download
- Quality Control (QC)
- Read Trimming and Cleaning
- Concatenation of Reads
- Removal of Ribosomal RNA (rRNA)
- Alignment to the Reference Genome
- Gene Quantification
Step 1: Download Bulk RNA Library
Use the fastq-dump from the fastx_toolkit package to download the degradome library from the NCBI Sequence Read Archive SRA.
# For SINGLE-END sequencing
fastq-dump SRR000000
# For PAIRED-END sequencing
fastq-dump --split-files SRR000000
# Replace SRR000000 with the correct identification number for each library.
If the files are compressed, decompress them as needed:
# Decompress GZ files:
gunzip SRR000000.fastq.gz
# Decompress ZIP files:
unzip SRR000000.fastq.zip
Step 2: Quality control (QC)
Evaluate read quality using FastQC.
fastqc SRR000000.fastq # (version v0.10.1)
Review the generated reports to identify potential issues such as adapter contamination or low-quality regions.
Step 3: Library cleaning
We used the FastQC report to identify adapters and contaminating sequences. There are many trimming programs but in this analysis we chose Trimommatic because it allows us to trim the reads to the desired size, remove adapters and filter out low quality reads and contaminants. For the code we prepared a FASTA file called remove_sequences.fasta with contaminating sequences and adapters, which looked like this example:
>contaminant_1
CTTATATTAGGCTTCTCCTCAGCGAAAATCACTGGCCGTCGTTTTACATG
>contaminant_2
AGAGTATTAGGCTTCTCCTCAGCGGAATTCACTGGCCGTCGTTTTACATG
>TruSeq Adapter
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTTTATCTCGTAT
# For SINGLE-END reads:
java -jar trimmomatic-0.39.jar SE -phred33 \
SRR000000.fastq.gz SRR000000_trimmed.fastq.gz \
ILLUMINACLIP:remove_sequences.fasta:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:36
# For PAIRED-END reads:
java -jar trimmomatic-0.39.jar PE -phred33 \
SRR000000_1.fastq.gz SRR000000_2.fastq.gz \
SRR000000_forward_paired.fastq.gz SRR000000_forward_unpaired.fastq.gz \
SRR000000_reverse_paired.fastq.gz SRR000000_reverse_unpaired.fastq.gz \
ILLUMINACLIP:remove_sequences.fasta:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:36
Step 3.3: Quality control
Run FastQC again to re-evaluate quality with FastQC.
fastqc SRR000000_output.fastq.gz
fastqc SRR000000_forward_paired.fastq.gz
fastqc SRR000000_reverse_paired.fastq.gz
If the reads are split across multiple lanes or files, concatenate them:
# For SINGLE-END reads:
cat SRR000000_L1_R1_1.fastq.gz SRR000000_L1_R1_2.fastq.gz > SRR000000_concat.fastq.gz
# For PAIRED-END reads:
cat SRR000000_L1_R1.fastq.gz SRR000000_L2_R1.fastq.gz > SRR000000_forward.fastq.gz
cat SRR000000_L1_R2.fastq.gz SRR000000_L2_R2.fastq.gz > SRR000000_reverse.fastq.gz
Filter rRNA sequences with Bowtie. First, you need to create a FASTA file with all rRNA sequences or use an rRNA database. In this case, Sec_rRNA-all.fasta contained all the rRNA sequences from the organism from which the cells in the RNA isolation originated. -->
5.1: Index rRNA FASTA
bowtie-build Sec_rRNA-all.fasta INDEXrRNA
5.2: Alignment to Remove rRNA
# For SINGLE-END reads:
bowtie -v 1 --un SRR000000_Clean.fastq INDEXrRNA SRR000000_concat.fastq.gz --al SRR000000_Ribosomal.fastq.gz
# For PAIRED-END reads:
bowtie -v 1 --un SRR000000_Clean_Paired.fastq INDEXrRNA -1 SRR000000_forward.fastq.gz SRR000000_reverse.fastq.gz --al SRR000000_Ribosomales_Paired.fastq
Key Parameters:
-v 1
: Allows one mismatch in the alignment.--un
: Outputs reads that do not align (keep these for further analysis).--al
: Outputs reads that align to rRNA (discard these)
Expected report:
# reads processed: 66368980
# reads with at least one reported alignment: 814686 (1.23%)
# reads that failed to align: 65554294 (98.77%)
Reported 814686 alignments
Please note! After filtering ribosomal reads with Bowtie, the file you get for unaligned reads (SRR000000_Clean_Paired.fastq) is not separated into two files (_1 and _2) for forward and reverse reads. This is a problem when you try to use HISAT2, which requires paired-end reads to be provided as two separate files. To split the single file generated by Bowtie you can use:
seqtk seq -1 SRR000000_Clean_Paired.fastq > SRR000000_Clean_Paired_1.fastq
seqtk seq -2 SRR000000_Clean_Paired.fastq > SRR000000_Clean_Paired_2.fastq
Align reads to a reference genome using Hisat2. Download the genome from Ensembl or UCSC.
6.1: Index the Genome
hisat2-build hg38.fa IndexGenome # For example: hg38.fa.gz (human genome)
6.2: Alignment
# For SINGLE-END reads:
hisat2 --no-unal --no-softclip -x IndexGenome -U SRR000000_Clean.fastq -S SRR000000.sam -summary-file SRR000000_summary.txt
# For PAIRED-END reads:
hisat2 --no-unal --no-softclip --dta -x IndexGenome -1 SRR000000_Clean_Paired_1.fastq -2 SRR000000_Clean_Paired_2.fastq -S SRR000000_Paired.sam --summary-file SRR000000_Paired_summary.txt
Expected report:
# Single End:
67606074 reads; of these:
67606074 (100.00%) were unpaired; of these:
6461149 (9.56%) aligned 0 times
60061930 (88.84%) aligned exactly 1 time
1082995 (1.60%) aligned >1 times
90.44% overall alignment rate
# Paired End:
38153328 reads; of these:
38153328 (100.00%) were paired; of these:
2000000 (5.24%) aligned concordantly 0 times
35100000 (92.00%) aligned concordantly exactly 1 time
533328 (1.40%) aligned concordantly >1 times
----
2000000 pairs aligned concordantly 0 times; of these:
50000 (0.25%) aligned discordantly 1 time
----
1950000 pairs aligned 0 times concordantly or discordantly; of these:
3900000 mates make up the pairs; of these:
1000000 (25.64%) aligned 0 times
2500000 (64.10%) aligned exactly 1 time
400000 (10.26%) aligned >1 times
95.76% overall alignment rate
7.1 Sorting SAM and convert to a BAM format
samtools sort SRR000000.sam > SRR000000.bam
samtools sort SRR000000_Paired.sam > SRR000000_Paired.bam
7.2 Indexing the BAM file for counting
samtools index SRR000000.bam
samtools index SRR000000_Paired.bam
7.3 Counting
Use featureCounts for gene-level quantification with gene annotation file (GTF format). Download the filefrom Ensembl orUCSC.
# We aligned with human genome, so now we use human gene annotation.
# Make sure all BAM files stay in the same directory because we will generate a TAB file with the counts of all imputs (*).
# For SINGLE-END reads:
featureCounts -t exon -g gene_id -a hg38.ensGene.gtf -o Matrix_single.tab *.bam
# For PAIRED-END reads:
featureCounts -p -t exon -g gene_id -a hg38.ensGene.gtf -o Matrix_paired.tab *.bam