Skip to content

parichit/p-ClustVal-Enhanced-ML-Clustering-for-High-Dimensional-RNA-Seq

Repository files navigation

p-ClustVal: A Novel p-adic Approach for Enhanced Clustering of High-Dimensional Single Cell RNASeq Data

p-ClustVal is a novel data transformation technique inspired by p-adic number theory that significantly enhances cluster discernibility in genomics data, specifically Single Cell RNA Sequencing (scRNASeq). By leveraging p-adic-valuation, p-ClustVal integrates with and augments widely used clustering algorithms and dimension reduction techniques, amplifying their effectiveness in discovering meaningful structure from data. The transformation uses a data-centric heuristic to determine optimal parameters, without relying on ground truth labels, making it more user-friendly. p-ClustVal reduces overlap between clusters by employing alternate metric spaces inspired by p-adic-valuation, a significant shift from conventional methods. Our comprehensive evaluation spanning 30 experiments and over 1400 observations, shows that p-ClustVal improves performance in 91% of cases, and boosts the performance of classical and state of the art (SOTA) methods. This work contributes to data analytics and genomics by introducing a unique data transformation approach, enhancing downstream clustering algorithms, and providing empirical evidence of p-ClustVal’s efficacy. The study concludes with insights into the limitations of p-ClustVal and future research directions.

Workflow of p-ClustVal transformation

Image description

Figure 1: Algorithmic workflow in p-ClustVal: (Steps 1-3), processing the raw data to filter low quality cells and genes, followed by data normalization and scaling. (Step 4), finding the optimal parameters in a data- centric manner and applying the p-ClustVal transform. (Step 5), applying clustering on the transformed data.

Dependency

The code base is written in Python3 and illustrative scripts are shared in respective directories. For benchmarking other published packages, following R packages needs to be installed in R. These packages are available for install from standard R command line.

- DR.SC (install.packages("DR.SC"))

- RaceID (install.packages("SIMLR"))

- SIMLR (install.packages("RaceID"))

- Seurat (install.packages("Seurat"))

How to replicate the results

  • Python scripts for running specific experiments are present in relevant directories, for example:
  1. benchmark_accuracy: contains scripts for running the clustering experiments.
  2. benchmark_dim_reduction: contains scripts for running the dimensionality reduction experiments.
  3. benchmark_scRNA_packages: contains the scripts for benchmarking state-of-the-art packages for clustering single cell data.

We will show an example of reproducing the results for the clustering experiment. The process for other experiments remains the same, except the name of the script that need to be run. For generating the results shown in Figure-6 of the manuscript, follow the steps:

1. Create a new folder named 'raw_data'. All datasets should be stored inside this folder. The link to various datasets used in ths study are given in the [Paper](https://link.springer.com/article/10.1007/s41060-024-00709-4).

2. Change the directory to the 'benchmark_accuracy' and copy the script 'benchmark_kpp_all_algos.py' in the same directory as 'raw_data'.

```
cd benchmark_accuracy
# Copy the benchmark_kpp_all_algos.py script to the folder containing the 'raw_data' directory.
```

3. Run the script on the command line

```
python3 benchmark_kpp_all_algos.py
```

Related Work

Check out paper on BioArchive Paper

Check out the paper on the journal website Paper

Contact Details

For help with running, or reporting issues-please let us know at parishar[at]iu[dot]edu. We would be happy to help you out.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published