This repository provides an R-based machine learning pipeline for classifying Acute Myeloid Leukemia (AML) subtypes using gene expression data from the GSE13159 dataset (Affymetrix HG-U133 Plus 2.0 platform). The study applies Random Forest (RF) and Support Vector Machines (SVM) with feature selection, class balancing, and hyperparameter tuning to improve classification accuracy.
The repository includes an R script that automates data preprocessing, model training, and evaluation, requiring RStudio (version 2024.12.0-467) and several key R libraries (GEOquery, limma, randomForest, caret, glmnet, ggplot2, pheatmap, e1071, smotefamily, AnnotationDbi, and reshape2). Running the script will execute the full AML classification pipeline, generating biomarker selection outputs, classification accuracy results, and visualizations of feature importance and model performance. Users interested in bioinformatics, computational biology, and cancer classification can adapt this pipeline for further research.
This project is released under the MIT License, allowing free use, modification, and distribution. Researchers and data scientists are encouraged to contribute improvements, explore additional datasets, and integrate alternative machine learning models to enhance AML classification accuracy. For full reproducibility, the dataset can be accessed via the GEO database (GSE13159), and all preprocessing steps are documented within the code. Further inquiries or contributions can be directed to the repository owner at lelouis.lnv@gmail.com.