Skip to content

SansamLab-Pipelines-Genomics/QuantSeqPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Alignment and counting of reads from QuantseqFWD Experiments

Introduction

This Quantseq pipeline converts short read .fastq files from 3' end sequencing into counts tables. The pipeline requires an index of the genome for the START aligner. After trimming to remove polyA sequences and adapters, the reads are aligned to the genome with STAR. Reads aligning to sequences adjacent to probable genomic internal polyA priming sites are removed. Reads are assigned to transcripts in a provided .gtf file and then a table a counts per transcript is generated. This table can be used for statistical analysis using DESeq2 or other statistical analysis software. This pipeline is comprised of shell scripts that can be executed on SLURM using the provided wrapper script. Conda environments are not provided, so please consult with your HPC administrator to run this.

Table of contents generated with markdown-toc

Scripts

Script Description
QuantSeqSlurmWrapper.sh Runs all necessary scripts to align and process quantseq reads with slurm. You must provide a pre-indexed genome for STAR.
TrimReads.sh Trims reads with bbmap
QuantSeqFilterAndCounts.sh This will filter reads from a QuantseqFWD experiment that end at a probable genomic poly-A priming site. It will then count the remaining reads over transcript feature coordinates defined in the gtf file.
AlignWithSTAR.sh Aligns reads to genome with STAR
makeTestSTARIndex.sh Makes the STAR index for the test data set.
GetInternalPrimingSites.R This takes a fasta of sequences +/-10bp around putative polyadenylation sites. It checks the genomic sequence in the region [–10..10] surrounding the polyadenylation events and filtered out alignments containing stretches of six consecutive A or with 70% A coverage in any 10-nt sub-window in this region. A bed file of false polyadenylation sites is returned. This is used in QuantSeqFilterAndCounts.sh.

Data files in repository

File Description
refFiles directory
polyA.fa.gz fasta with polyA for bbmap trimming
truseq.fa.gz truseq adapter sequences for bbmap trimming
testFiles directory
test.chrm.lengths chromosome length for test data set
test.fa genome sequence for test data set
test.fastq Quantseq sequencing reads for test data set
test.gtf transcript feature coordinates for test data set

Additional useful files not included

####note: these are not needed for the test run

  1. Zebrafish gtf file - from https://www.umassmed.edu/globalassets/lawson-lab/downloadfiles/v4.3.2.gtf
  2. Index of Zebrafish genome for STAR - from https://galaxyweb.umassmed.edu/pub/dnext_data/genome_data/zebrafish/GRCz11/refseq_v4.3.2/STARIndex/
  3. Genome sequence in fasta format

Example 1 - run individual scripts on included test files

Create a directory for test output

mkdir test_outputdir

Make test STAR index

mkdir starTestIndex
sbatch --cpus-per-task 6 --mem 48G scripts/makeTestSTARIndex.sh

Trim reads

sbatch scripts/TrimReads.sh \
testFiles/test.fastq \
test_outputdir \
refFiles/polyA.fa.gz,refFiles/truseq.fa.gz

Align reads to genome with STAR

sbatch --cpus-per-task 6 --mem 48G scripts/AlignWithSTAR.sh \
testFiles/test.fastq \
starTestIndex/ \
test_outputdir

Remove probable internal priming reads and count reads over transcripts

sbatch scripts/QuantSeqFilterAndCounts.sh \
test \
testFiles/test.chrm.lengths \
testFiles/test.fa \
testFiles/test.gtf \
test_outputdir

Example 2 - run the entire pipeline on a single sample with slurm wrapper script

Arguments:

Position Description Example for test
1 output directory name test2_outputdir
2 name of fastq file with Quantseq reads testFiles/test.fastq
3 reference files for trimming refFiles/polyA.fa.gz,refFiles/truseq.fa.gz
4 name of directory with the genome index for STAR starTestIndex/
5 chromosome sizes file testFiles/test.chrm.lengths
6 genome sequence fasta file testFiles/test.fa
7 gtf file for transcriptome testFiles/test.gtf
sbatch scripts/QuantSeqSlurmWrapper.sh \
test2_outputdir \
testFiles/test.fastq \
refFiles/polyA.fa.gz,refFiles/truseq.fa.gz \
starTestIndex/ \
testFiles/test.chrm.lengths \
testFiles/test.fa \
testFiles/test.gtf

Expected output of test

File Description
starTestIndex/ output from makeTestSTARIndex.sh; folder with files of STAR index
test_trimmed.fastq output from TrimReads.sh; trimmed version of the input fastq - test.fastq
testAligned.sortedByCoord.out.bam output from AlignWithSTAR.sh; aligned reads
testLog.final.out output from AlignWithSTAR.sh; see STAR manual
testLog.out output from AlignWithSTAR.sh; see STAR manual
testLog.progress.out output from AlignWithSTAR.sh; see STAR manual
testSJ.out.tab output from AlignWithSTAR.sh; see STAR manual
test_positive_strand.bam output from QuantSeqFilterAndCounts.sh; reads mapping to positive strand
test_positive_strand.bam.bai output from QuantSeqFilterAndCounts.sh
test_negative_strand.bam output from QuantSeqFilterAndCounts.sh; reads mapping to negative strand
test_negative_strand.bam.bai output from QuantSeqFilterAndCounts.sh
test_positive_strand_flank.bed coorindates +/- 10 bp around ends of positive strand mapped reads
test_positive_strand_flank.fasta output from QuantSeqFilterAndCounts.sh; genomic sequences of test_positive_strand_flank.bed
test_negative_strand_flank.bed output from QuantSeqFilterAndCounts.sh; coorindates +/- 10 bp around starts of positive strand mapped reads
test_negative_strand_flank.fasta output from QuantSeqFilterAndCounts.sh; genomic sequences of test_negative_strand_flank.bed
test_positive_strand_internalPriming.bed output from QuantSeqFilterAndCounts.sh and GetInternalPrimingSites.R; coordinates of positive strand read ends with likely genomic polyA internalpriming
test_negative_strand_internalPriming.bed output from QuantSeqFilterAndCounts.sh and GetInternalPrimingSites.R; ; coordinates of negative strand read starts with likely genomic polyA internal priming
test_positive_strand_internalPrimingReads.txt output from QuantSeqFilterAndCounts.sh; IDs of reads at likely internal priming sites
test_negative_strand_internalPrimingReads.txt output from QuantSeqFilterAndCounts.sh; IDs of reads at likely internal priming sites
test_positive_strand_filtered.bam output from QuantSeqFilterAndCounts.sh; mapped reads after removal of those listed in test_positive_strand_internalPrimingReads.txt
test_negative_strand_filtered.bam output from QuantSeqFilterAndCounts.sh; mapped reads after removal of those listed in test_negative_strand_internalPrimingReads.txt
test_filtered.bam output from QuantSeqFilterAndCounts.sh; merged test_positive_strand_filtered.bam and test_negative_strand_filtered.bam
test_filtered.bam.csi output from QuantSeqFilterAndCounts.sh
test_counts.tab output from QuantSeqFilterAndCounts.sh; read counts over transcripts; made with htseq-count

Packages used to test pipeline

Package Version
slurm 20.02
samtools 1.11
bedtools 2.30.0
bedops 2.4.3
picard 2.21.2
python 3.7.0
htseq 0.13.5
bbmap 36.99
star 2.6.1
R 4.0.3
zoo (R package) 1.8-8

About

QuantSeqPipeline

Resources

License

Stars

Watchers

Forks

Packages

No packages published