Impacts of gDNA contamination in RNAseq (C.virginica) on downstream analyses #1421

kubu4 · 2022-03-07T16:01:41Z

kubu4
Mar 7, 2022
Maintainer

Before I dive into this, @yaaminiv could you please remind me how the female/male gonads were processed for RNA isolation for Zymo project zr4059?

Here's the background story:

We have some C.virginica RNAseq data from female and male gonad samples. I aligned this data to the NCBI C.virginica genome and noticed that the overall alignment rate (aggregate of all the samples) was low, around 65% (normally, it should be >80%). Additionally, alignment rates in male samples were drastically lower than the females.

I reviewed the sequencing quality data and things looked fine. I suspected that rRNA contamination could be a possible culprit (rRNA is often difficult to align due to low complexity which means reads get mapped to multiple locations and then get discarded because the mapping software can't definitively decide where those reads should actually get mapped). Additionally, after looking at the documentation provided by ZymoResearch (which performed the library prep/sequencing), I discovered that they used a rRNA depletion system instead of an mRNA enrichment method. Our experience with the former has generally showed them to be ineffective in marine molluscs.

I contacted ZymoResearch to see if they could provide me with data (specifically, a Bionalyzer/TapeStation electropherogram) confirming that the rRNA depletion process was successful. As it turns out, they do not perform this step as part of their workflow.

During the exchanges with ZymoResearch, I also discovered that the library prep kit they use has a recommendation for trimming after sequencing that requires removal of an additional 10bp from the 5' ends of R2 reads. Simple adapter removal is insufficient.

So, these two factors (rRNA contamination and trimming) led me to believe that these could explain the poor alignment rates. ZymoResearch were dubious that rRNA contamination would be present and/or would not drastically impact alignment rates. They offered to run some of the data through their pipeline to look at things and see what they could find. They've shared the following MulitQC report (note: it may take a minute to load in your browser):

https://gannet.fish.washington.edu/Atumefaciens/20220302-cvir-RNAseq-gonad-zymo_multiqc/zr4059_multiqc_report_with_alignment.html

The ZymoResearch explanation of their reports is here:

https://github.com/Zymo-Research/service-pipeline-documentation/blob/master/docs/how_to_use_RNAseq_report.md

The big takeaway here is that all of the male samples (samples names ending with an M) have the following issues:

significant amounts of gDNA
- characterized by significant quantities of reads mapping to introns; see RSeQC section of report
- characterized by significant quantities of reads mapping to sense strand; see Infer Experiment section of report
possible contaminating sequence
- characterized by two peaks in GC content; see Per Sequence GC Content section of report

So, with all of that in mind, does anyone have any thoughts/discussion on how gDNA contamination would impact:

Differential gene expression?
Transcriptome assembly?

Keeping in mind we have an annotated genome that was used for aligning RNAseq. Will differential expression analysis take this into account and only deal with reads falling into regions annotated as RNA/CDS/exon/etc and ignore reads falling into intronic/intergenic regions? Same question applies for genome-guided transcriptome assembly (I'll actually hit up the Trinity developer(s) to see their thoughts).

Or, do we have to filter the data ourselves to ensure that downstream analyses are only using reads aligning in RNA/CDS/exon/etc?

I'd like to assume that downstream analysis will utilize only data which aligns to the parts of the genome that one would expect to generate transcripts, but we know what happens when we assume - we break the Golden Rule of Bioinformatics!

On a side note, that MultiQC report is pretty boss! I always forget about all of the modules available! Also, it looks like they used an RNAseq Nextflow pipeline to handle all of that data processing (including some differential gene expression) - definitely pretty slick!

kubu4 · 2022-03-07T16:03:09Z

kubu4
Mar 7, 2022
Maintainer Author

Oh, also, ZymoResearch couldn't make any conclusions regarding rRNA contamination because they indicated the C.virginica genome rRNA annotations are incomplete, so mapping data to rRNA for this project was unreliable.

0 replies

yaaminiv · 2022-03-07T16:16:50Z

yaaminiv
Mar 7, 2022
Collaborator

Before I dive into this, @yaaminiv could you please remind me how the female/male gonads were processed for RNA isolation for Zymo project zr4059?

I used the Zymo Quick DNA/RNA Microprep Plus Kit on frozen female gonad (mixed cell type) and frozen sperm. Some relevant lab notebook links (could also go to my lab notebook >> tags >> "labwork" >> it's the series "Virginica Gonad DNA Extractions")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Impacts of gDNA contamination in RNAseq (C.virginica) on downstream analyses #1421

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Impacts of gDNA contamination in RNAseq (C.virginica) on downstream analyses #1421

Uh oh!

kubu4 Mar 7, 2022 Maintainer

Replies: 2 comments

Uh oh!

kubu4 Mar 7, 2022 Maintainer Author

Uh oh!

yaaminiv Mar 7, 2022 Collaborator

kubu4
Mar 7, 2022
Maintainer

kubu4
Mar 7, 2022
Maintainer Author

yaaminiv
Mar 7, 2022
Collaborator