GitHub - yuemolin/MOFUN-CCC

MOFUN-CCC

Multi Omics FUsion neural Network - Computational Cell Counting

MOFUN-CCC is a multi-modal deep learning algorithm that operates under a supervised framework, leveraging intermediate fusion techniques to process bulk gene expression and bulk DNA methylation data. Its primary objective is to generate absolute cell counts as its output.
Report Bug · Request Feature

Table of Contents

Introduction
Getting Started
- Prerequisites
- Installation
Usage
Roadmap
Contributing
License
Contact
Acknowledgments

Introduction

MOFUN-CCC (Multi Omics FUsion Neural network- Computational Cell Counting) is a multi-modal deep learning algorithm that operates under a supervised framework, leveraging intermediate fusion techniques to process bulk gene expression and bulk DNA methylation data. Its primary objective is to generate absolute cell counts as its output.

During the training process, informative features were selected using marginal linear regression. To address the issue of dissimilar distributions between the two data modalities, the gene expression data (measured as TPM for RNA-seq or intensity for microarray) were transformed into log space, and the DNA methylation data were represented as M values. Subsequently, both data modalities were re-scaled using min-max scaling to make the values in range of 0-1. To mitigate the problem of varying cell counts, we further scaled each cell type by its median counts, resulting in each cell type having an approximate mean count of 1.

A data augmentation step is implemented as a preliminary stage before the training process. A natural assumption can be made that both the gene and DNAm data inherently share common cellular components. Based on that, our approach triples the training data by generating two separate copies with a zero mask, each representing a single modality input. This not only expands the dataset available for training, consequently mitigating the risk of overfitting but also equips the model with the capability to make predictions when only a single modality data is available.

The embedding module reduces the high-dimensional input into a low-dimensional feature. An intermediate fusion module is incorporated to enable information exchange between the gene and DNAm modalities, while shortcut connections allow the fused features to retain information from the original features, resulting in a more robust model even when only a single modality is available. Multiple fusion blocks generate mixed-information features, which are concatenated and passed to the output module for final prediction. The predicted cell counts will transformed back to raw space based on the median scaling factor.

(back to top)

Getting Started

Installation

Enviroment setup

# Create Conda Environment
conda create -n MOFUN_CCC python=3.10.8 -y
conda activate MOFUN_CCC

Install required packages

# Install pytorch for GPU and required packages
pip3 install torch torchvision torchaudio
pip install -r requirements.txt

Clone the repo

# clone the repo!
git clone https://github.com/yuemolin/MOFUN-CCC.git
cd MOFUN-CCC

(back to top)

Usage

MOFUN-CCC have two primary functions:

Predict Cell Counts from the trained model:
Utilize pretrained models to predict cell counts from both bulk gene expression and bulk DNA methylation data. Our algorithm is optimized for the most accurate predictions when both gene expression and DNA methylation data are provided. However, it also performs robustly with single modality input.

Input: (Both or at least one of them)
- Gene expression Matrix (csv file, Gene as row, Sample as collumn)
- DNA methylation Matrix (csv file, CpG site as row, Sample as collumn)
Output:
- Predicted Count Matrix (csv file, Sample as row, 5 cell types at the collumn)

python Main_predict.py \
--RNA  <Your_RNA_file.csv>\
--DNAm <Your_DNAm_file.csv> \
--Model_Path <Your_model_folder> \
--Output <Prediction_results.txt>

Train Custom Models:
You can also train your own model from your local datasets.

Input:
- Gene expression data
- Methylation data
- Cell counts data
Output:

A pytorch model .pth file that you can used later to get the predictions

  python Main_train.py \
  --Count <Your_Count_file.csv> \
  --RNA <Your_RNA_file.csv> \
  --DNAm <Your_DNAm_file.csv> \
  --Output <Your_output_folder>

Please note that the input data trio must originate from the same individuals for accurate model training.

When you train the model, since the high dimentional nature of the gene expression and DNAmethylation, MOFUN-CCC will do a marginal linear regression to prefilter the markers first, but this part is time consuming. you may provide a marker file directly. check the current folder for the format.

  python Main_train.py \
  --Count <Your_Count_file.csv> \
  --RNA <Your_RNA_file.csv> \
  --DNAm <Your_DNAm_file.csv> \
  --GEP_Marker <The_GEP_Marker.txt> \
  --DNAm_Marker <The_DNAm_Marker.txt> \
  --Output <Your_output_folder>

(back to top)

Parameters Detail

Main_train.py

Input/Output Parameters

--Count (str, required):
Path to a tab-separated text file containing cell counts data (rows: Samples, columns: Cell types).

--RNA (str, required):
Path to a tab-separated text file containing gene TPM data (rows: Genes, columns: Samples).

--DNAm (str, required):
Path to a tab-separated text file containing CpG Beta data (rows: CpGs, columns: Samples).

--Output (str, required):
Output folder path for saving the trained model.

Marker Parameters

--GEP_Marker (str):
Path to the marker file, a list of gene names, or an association matrix (generate when left blank).

--DNAm_Marker (str):
Path to the marker file, a list of CpG names, or an association matrix (generate when left blank).

--Marker_Method (str, choices: ["FC", "P"]):
Marker selection method, where "FC" stands for fold change, and "P" stands for p-value.

--RNA_Marker_num (int, default: 6000):
Number of RNA markers to select based on the Marker_Method.

--DNAm_Marker_num (int, default: 6000):
Number of DNAm markers to select based on the Marker_Method.

Training Data Operation Parameters

--RNA_transform (str, default: "Range", choices: ["Identity", "MeanStd", "Range"]):
RNA transformation method.

--DNAm_transform (str, default: "Range", choices: ["Identity", "MeanStd", "Range", "Beta"]):
DNAm transformation method.

--transform_by_feature (boolean):
Transform by feature if claimed; otherwise, by sample.

--Data_augmentation (str, default: "Zero", choices: ["Zero", "Noise", "No"]):
Data augmentation method.

--scale_cellcounts (boolean, default: True):
Scale cell counts by total cell counts.

Model Parameters

--Model (str, default: "./Models/Default_structure.json"):
Model structure file path (JSON format).

--Loss (str, default: "L1loss", choices: ["L1loss", "L2loss", "CrossEntropy"]):
Loss function.

--Activation (str, default: "relu", choices: ["relu", "leakyRelu", "Elu", "Celu", "Gelu"]):
Activation function.

--Dropout (float, default: 0.2):
Dropout rate of the first layer.

--Learning_rate (float, default: 5e-5):
Learning rate.

--Batch_num (int, default: 80):
Number of batches.

--Epochs (int, default: 100): Number of epochs.

Other Parameters

--device (str, default: "detect", choices: ["cuda", "cpu", "detect"]):
Device to use for training.

--seed (int, default: 424):
Random seed.

Main_predict.py

Input Parameters

--RNA (str):
Path to a tab-separated text file containing gene TPM data (rows: Genes, columns: Samples).

--DNAm (str):
Path to a tab-separated text file containing CpG Beta data (rows: CpGs, columns: Samples).

Output Parameter

--Output (str, required):
Output filename for storing prediction results.

Model Parameters

--Model_Path (str, required):
Path to the saved model, which should contain at least the following files: Model_str.pkl, Model_dict.pt, and TrainShell.pkl.

Roadmap

Add Detailed parameters
Add shiny app address
Add warnings (large file , linear regression time,)
Add data short cut to show actual data
Add data format, beta and log2tpm
Introduce the two models
Add paper links

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Molin Yue - website - moy6@pitt.edu

Project Link: https://github.com/yuemolin/MOFUN-CCC

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
MOFUN_CCC		MOFUN_CCC
image		image
.gitignore		.gitignore
LICENSES		LICENSES
README.md		README.md
README_old.md		README_old.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MOFUN-CCC

Introduction

Getting Started

Installation

Usage

Parameters Detail

Input/Output Parameters

Marker Parameters

Training Data Operation Parameters

Model Parameters

Other Parameters

Input Parameters

Output Parameter

Model Parameters

Roadmap

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

yuemolin/MOFUN-CCC

Folders and files

Latest commit

History

Repository files navigation

MOFUN-CCC

Introduction

Getting Started

Installation

Usage

Parameters Detail

Input/Output Parameters

Marker Parameters

Training Data Operation Parameters

Model Parameters

Other Parameters

Input Parameters

Output Parameter

Model Parameters

Roadmap

Contributing

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages