GitHub - parichit/p-ClustVal-Enhanced-ML-Clustering-for-High-Dimensional-RNA-Seq

p-ClustVal: A Novel p-adic Approach for Enhanced Clustering of High-Dimensional Single Cell RNASeq Data

p-ClustVal is a novel data transformation technique inspired by p-adic number theory that significantly enhances cluster discernibility in genomics data, specifically Single Cell RNA Sequencing (scRNASeq). By leveraging p-adic-valuation, p-ClustVal integrates with and augments widely used clustering algorithms and dimension reduction techniques, amplifying their effectiveness in discovering meaningful structure from data. The transformation uses a data-centric heuristic to determine optimal parameters, without relying on ground truth labels, making it more user-friendly. p-ClustVal reduces overlap between clusters by employing alternate metric spaces inspired by p-adic-valuation, a significant shift from conventional methods. Our comprehensive evaluation spanning 30 experiments and over 1400 observations, shows that p-ClustVal improves performance in 91% of cases, and boosts the performance of classical and state of the art (SOTA) methods. This work contributes to data analytics and genomics by introducing a unique data transformation approach, enhancing downstream clustering algorithms, and providing empirical evidence of p-ClustVal’s efficacy. The study concludes with insights into the limitations of p-ClustVal and future research directions.

Workflow of p-ClustVal transformation

Figure 1: Algorithmic workflow in p-ClustVal: (Steps 1-3), processing the raw data to filter low quality cells and genes, followed by data normalization and scaling. (Step 4), finding the optimal parameters in a data- centric manner and applying the p-ClustVal transform. (Step 5), applying clustering on the transformed data.

Dependency

The code base is written in Python3 and illustrative scripts are shared in respective directories. For benchmarking other published packages, following R packages needs to be installed in R. These packages are available for install from standard R command line.

- DR.SC (install.packages("DR.SC"))

- RaceID (install.packages("SIMLR"))

- SIMLR (install.packages("RaceID"))

- Seurat (install.packages("Seurat"))

How to replicate the results

Python scripts for running specific experiments are present in relevant directories, for example:

benchmark_accuracy: contains scripts for running the clustering experiments.
benchmark_dim_reduction: contains scripts for running the dimensionality reduction experiments.
benchmark_scRNA_packages: contains the scripts for benchmarking state-of-the-art packages for clustering single cell data.

We will show an example of reproducing the results for the clustering experiment. The process for other experiments remains the same, except the name of the script that need to be run. For generating the results shown in Figure-6 of the manuscript, follow the steps:

1. Create a new folder named 'raw_data'. All datasets should be stored inside this folder. The link to various datasets used in ths study are given in the [Paper](https://link.springer.com/article/10.1007/s41060-024-00709-4).

2. Change the directory to the 'benchmark_accuracy' and copy the script 'benchmark_kpp_all_algos.py' in the same directory as 'raw_data'.

```
cd benchmark_accuracy
# Copy the benchmark_kpp_all_algos.py script to the folder containing the 'raw_data' directory.
```

3. Run the script on the command line

```
python3 benchmark_kpp_all_algos.py
```

Related Work

Check out paper on BioArchive Paper

Check out the paper on the journal website Paper

Contact Details

For help with running, or reporting issues-please let us know at parishar[at]iu[dot]edu. We would be happy to help you out.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
benchmark_accuracy		benchmark_accuracy
benchmark_dim_reduction		benchmark_dim_reduction
benchmark_num_neighbors		benchmark_num_neighbors
benchmark_scRNA_packages		benchmark_scRNA_packages
images		images
utils		utils
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

p-ClustVal: A Novel p-adic Approach for Enhanced Clustering of High-Dimensional Single Cell RNASeq Data

Workflow of p-ClustVal transformation

Dependency

How to replicate the results

Related Work

Contact Details

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

parichit/p-ClustVal-Enhanced-ML-Clustering-for-High-Dimensional-RNA-Seq

Folders and files

Latest commit

History

Repository files navigation

p-ClustVal: A Novel p-adic Approach for Enhanced Clustering of High-Dimensional Single Cell RNASeq Data

Workflow of p-ClustVal transformation

Dependency

How to replicate the results

Related Work

Contact Details

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages