This repository is the outcome of a project conducted at the Bioinformatics Institute, aimed at comprehensively characterizing the molecular changes associated with aging around the age of 37 through multiomics analysis. The utilized omics data include Epigenomics and RNA-sequencing.
Why choose 37 instead of 42, you might ask? There is a lot of diverse data showing the importance of processes in this particular age interval (about 37 years), in particular, changes in glucose metabolism, an increase in markers of oxidative stress (MDA) in the blood plasma from 37 years, as well as in the interval 20-25 years for men and women [1-4]. But perhaps 42 is the correct answer 😊, who knows?

The workflow comprises the following steps:
- Data Collection: Find and download epigenomic and transcriptomic data from aging cohorts, including DNA methylation profiles.
- Batch Correction: Detect and correct for batch effects in the data to ensure accurate analysis.
- Sample Sex Identification: Determine the sex of the samples.
- Main Pipeline Creation: Develop main scripts for individual cohorts and meta-cohorts.
- CpG selection (retain only differing CpGs).
- Detection of CpG groups using various statistical calculations and clustering methods.
- Selection of the optimal statistical calculation and clustering method.
- Detection and plotting of trend lines for each CpG/RNAseq.
- Validation of identified groups on PBMC cohort.
- Execution of Main Script: Run the main script for all methylation and microarray metacohorts, and additionally, analyses were performed separately for males and females.
- Merge Findings: Combine results for comprehensive omics analysis.
- Functional Gene Detection: Conduct enrichment analysis to elucidate functional pathways associated with identified trends.
The entire process utilizes Jupyter notebooks, and the outputs are visually represented in the schematic image provided below:

DNA methylation is an epigenetic mechanism that involves the addition of a methyl group to DNA molecules, typically at cytosine bases within CpG dinucleotides. This modification plays a crucial role in gene regulation, development, and various cellular processes. In the context of aging, epigenomics explores how these modifications change over time and contribute to age-related processes and diseases.
Microarray methylation data refers to information obtained through the use of microarray technology to profile DNA methylation patterns across the genome. This method allows researchers to analyze DNA methylation levels at multiple genomic loci simultaneously, providing a comprehensive view of epigenetic modifications. Microarray-based DNA methylation profiling involves enriching unmethylated and methylated DNA fractions, which are then interrogated on microarrays containing probes specific to these regions.
More than 6000 samples were found for further analysis. For the initial step, approximately 2000 samples of PBMC were collected.

In the realm of transcriptomics data and microarray analysis, RNA sequencing (RNA-seq) is essential techniques for understanding gene expression changes, and as cosequenses methabolic changes in the cells and body.
RNA-seq is a powerful tool for measuring the abundance of RNA transcripts in a sample, providing insights into gene expression levels, alternative splicing, and transcript isoforms. This technique involves sequencing cDNA synthesized from RNA molecules extracted from cells or tissues, allowing researchers to quantify gene expression levels and identify differentially expressed genes under various conditions or disease states.
Microarray transcriptomics data refers to information obtained through the use of microarray technology to profile gene expression patterns across the genome. This method allows researchers to analyze gene expression levels at multiple genomic loci simultaneously, providing a comprehensive view of transcriptional activity. Microarray-based transcriptomic profiling involves hybridizing RNA samples onto microarrays containing probes specific to different genes or transcripts. By comparing gene expression patterns between samples or conditions, researchers can identifychanges about gene expression and gain insights into cellular processes, biological pathways, and disease mechanisms.
There are 9 datasets where annotated and merged by the presenting common genes: GSE56047, GSE16717, GSE67220, GSE56033, GSE30483, GSE47353, GSE68759, GSE7551, GSE65907.
There are 4588 samples collected in data, with differens ages and genders. Disrtibution of all dataest in common dataset you can see below
Using STRINGDB, we observed that these genes are responsible for changes in cellular regulation, autophagy, cell cycles, and more. Enrichr analysis showed that the top two enrichments were in FLT3 signaling and cellular senescence.
For future investigations, collecting more datasets, including lipidomics and metabolomics data, is essential for a comprehensive understanding of aging's molecular changes. In the next steps of our research, reviewing our analytical pipeline in transcriptomic data and improving correlation calculations for all data could also enhance our analysis. But even at this stage our study contributes to understanding aging by identifying key genes and pathways undergoing significant changes around 24 y.o.
If you have any questions, suggestions, or encounter about the pipeline or methylation data, feel free to reach out to CaptnClementine 💛. You can also contact me directly on Telegram via this link.
2 https://pubmed.ncbi.nlm.nih.gov/21451205/
3 https://sci-hub.ru/https://www.nature.com/articles/s41591-019-0673-2