This project reanalyzes the affymetrix microarray data of a study titled The genomic response of the retinal pigment epithelium to light damage and retinal detachment which was originally conducted on 2008. The goal of the study was to understand the transcriptional changes that takes place in the Retinal Pigment Epithelium to occular light damage. However, it is worth noting that the goal of the study was not the motivation for this project but rather this project was done in a demonstrative way to understand and learn the process of reliably and meaning fully analyze and interpret affymetrix microarray data.
Ideally good quality affymetrix microarray samples would have similar to identical log intensity peaks for all samples, However from the plot above its evident that two out of the six samples have significantly different peak height as compared to the rest of the samples.The boxplot above shows the distribution of expression values of different samples It can be evidently seen from the mean for all the samples vary from each other, however ideally the members of the same replicate group should have similar mean but in the plot it can be seen that DARCR1 and LDRR1 have similar means despite being from different replicate groups and their mean differ significantly from the other members of their respective replicate groups. Using the robust multiarray analysis method implemented in the rma() function of affy package the data was normalized. Comparing this plot to the pre normalized box plot it is evident that all the samples now have comparable means which will allow us to perform unbiased differential gene expression analysis. From the PCA plot its is evident that the within group variance between the first members of the replicate groups was greater than the between group variance which raised further questions regarding the quality of the samples of the first members of the replicate groups. Further more in order to confirm our suspicion we procedded with heirarchical clustering.
The heirarchical clustering confirmed our suspision that there was something wrong with the samples DARCR1 and LDRR1 and hence the quality control step was revisited once again to uncover the reason behind the clustering of these two samples from different replicate group.
The simpleaffy package was used to perform a more comprehensive quality control procedure.
However it is worth noting that the simpleaffy package is an older package that was removed from Bioconductor in version 3.13. To use this package, earlier versions of Bioconductor need to be installed. However, this can be challenging, as the latest version of R (4.4.2 at the time of writing) is not compatible with earlier versions of Bioconductor, which require R versions 4.0.x. This issue can be addressed by using the R Installation Manager (rig) on Ubuntu to install earlier versions of R. Additionally, the renv package in R can be used to manage earlier versions of Bioconductor without interfering with the central package repository.
This RNA degradation plot shows the steep slope of all the samples in the dataset but despite that two of the samples within the sample have a more steeper slope both of which is the first member of the two replicate groups.
The QC statistic plot clearly shows that the present call percentages for all the samples in the dataset are very low. However, the samples DARCR1 and LDRR1 have much lower present calls relative to the other members of their respective replicate groups.
This indicates that while RNA degradation occurred in all the samples of the dataset, the first members of the two replicate groups were heavily affected by RNA degradation. This resulted in them having a similar transcriptional profile despite belonging to different replicate groups.
It is worth noting that the original study did not present any plots in the publication or reported any low quality samples however as we will see later that this might explain the reasons that the differential expression results are not completely reproducible. As only three out of the five genes that were reported to be significantly upregulated were found to be significantly upregulated when p-value is adjusted for multiple testing
Out of the four genes (Mmp3, Serpin a3n, Serpin b1a, Osmr) reported in the original study, three can be seen in the volcano plot, which validates the reproduction of previous work. Additionally, the comprehensive quality control analysis explains why the samples from different replicate groups exhibited similar transcriptional profiles.