Skip to content

Test cases

Thomas Cokelaer edited this page Aug 11, 2020 · 20 revisions

We present a couple of examples to emphasize the fact that the FastQC plots should be interpreted with case.

Authors: Thomas Cokelaer, Laure Lemee

GC plot exhibiting non-normal and non-similar shapes

phage streptococcus

non-constant per-base sequence content

The per-base sequence content plot shows the percentage of each of the four normal DNA bases on the y-axis at each base position. FastQC reports warning if difference between the ACGT > 10% in any position and an error if greater than 20%. Here below, we have two genomes from the same library and the left and right are quite different. In the left hand side, FastQC reports a good run. Indeed, the ACGT content are constant along the read position. This is what is expected for a long genome (here 22Mb). In the right In the right hand side, FastQC reports an error. Yet, as shown below the quality is even better (Phread Score >35). In fact there is nothing wrong here. The sequence genome is just short and therefore lack diversity when computing this kind of plot.

Plasmodium Virome

Here is another example where fastQC will report an error whereas the run is perfectly fine. This concern a 16S library and here again due to the short length of the reads and diversity of genomes, the ACGT line are not straight nor random

16s acgt content

RNA-seq N's present in large proportions

Once a fastqc (and multiqc) is available, we usually look at the quality plot. Those tools provide a green/orange/red light indicating no warning/warning/error status. In this RNA-seq experiment with 6 samples, we got a per base sequence quality plot showing a drop of quality from position 0 to 40, which is pronounced in one of the sample. We have the feeling that one sample is totally wrong since the quality is below 20 at the beginning of all reads.

A complementary plot is the per base N content, which is shown here below:

Here we see the same samples. The red curve correspond to the same sample that was red in the previous plot. This sample has actually 40% of Ns at the beginning and is therefore tagged with a red color (error) indicating that this sample should be dropped.

In fact, what is going on here is that the quality of the library was such that lots of dimers of adapters were created. 40% of the reads actually contains no data. Sequencers created reads with just N's and no genomic content. Yet, the other 60% of reads were totally correct and with high quality. Moreover, the reads made o Ns have a length of 35 bp. Coming back to the first plot, if we ignore the reads with Ns (that have poor quality), the rest of the data has a expected high quality.

Subsequent RNA-seq analysis, which ignore the reads with Ns, showed no different between this sample and the other 5 samples.

Conclusion: even tough the plots indicated a very poor quality for one sample, ignoring the Ns and assuming the yield of reads is enough for the bioinformatics analysis, the reads were usable and the experiment validated.

Clone this wiki locally