Uncovering the Genetic Basis of Glioblastoma Heterogeneity through Multimodal Analysis of Whole Slide Images and RNA Sequencing Data
Ahmad Berjaoui, Eduardo Hugo Sanchez, Louis Roussel, Elizabeth Moyal Cohen-Jonathan
This repository contains the code and scripts to download and preprocess the data mentioned in our paper. The latest version of the paper can be found at this address
Data is available here. The data cannot be used for commercial purposes. Please cite the paper if you use the data. The data is organized as follows:
- WSI: Whole Slide Images used for training the AI models.
- RNA-seq: The RNA-seq encoder was pre-trained using data from The Cancer Genome Atlas (TCGA). The data needs to be downloaded to
gbdata/rna/tcga
. Oncopole RNA-seq raw data is ingbdata/rna/oncopole_rna_raw.tsv
. Normalized RNA-seq Oncopole data needs to be downloaded togbdata/rna/oncopole
. All RNA-seq data has already been normalized using theDESeq2
package. - Gene clustering: Gene clustering uses the protein-to-protein interaction (PPI) graph from Precision Medecine Knowledge Graph. Only genes that are present in both the PPI knowledge graph and the RNA seq data have been kept. The filtered PPI graph CSV file can be found in
gbdata/gene_clustering
. The cluster to genes mapping is ingbdata/gene_clustering/ppi_cluster_to_ensembl_genes.json
. The paper compares using this cluster against a mere chromosome to gene mapping, which can be found ingbdata/gene_clustering/chromosome_to_ensembl_genes.json
. - Survival: Patient survival data (OS and PFS) is in
gbdata/survival/patients.csv
. Mice survival data (lineage dependant) is ingbdata/survival/lineage.csv
.
- Clone the repository:
git clone https://github.com/ma3oun/gbheterogeneity.git
- Install the dependencies using
uv sync
- Download the data and place it in the
gbdata/rna
fortcga.tar.gz
andgbdata/wsi
forwsi.tar.gz
. Unzip the files in their respective folders by usingtar -xvf tcga.tar.gz
andtar -xvf wsi.tar.gz
. - You can also download trained models to avoid retraining them from scratch. Place them in the
trained_models
folder. - Set PYTHONPATH to the root of the repository:
export PYTHONPATH=$(pwd)
- Download the PPI graph from Precision Medecine Knowledge Graph and place it in
gbdata/gene_clustering/kg.csv
. - Run
uv run gene_clustering/ppi_filtering.py
to filter the PPI graph to keep only genes that are present in both the PPI graph and the RNA seq data. This will create thegbdata/gene_clustering/ppi_filtered.csv
file. - Run
uv run gene_clustering/ppi_clustering.py
to cluster the genes using the PPI graph and the Louvain algorithm. This will create thegbdata/gene_clustering/ppi_cluster_to_ensembl_genes.json
file.
A pretrained model is available in trained_models/rna_encoder.bin
. To train the RNA encoder from scratch, run uv run scripts/pretrain_rna.py
. This will create the trained_models/rna_encoder.bin
file. Note that the default training uses ppi clustering. To use chromosome clustering, modify the script to point to gbdata/gene_clustering/chromosome_cluster_to_ensembl_genes.json
instead.
A pretrained model is available in trained_models/wsi_encoder.bin
. To train the WSI encoder from scratch, run uv run scripts/pretrain_wsi.py
. This will create the trained_models/wsi_encoder.bin
file.
Multimodal training uses distributed data parallelism to train the model on multiple GPUs. To train the model, first activate the environment with source .venv/bin/activate
and then run torchrun standlone --nnodes=1 --nproc_per_node=<number_of_gpus> scripts/train_multimodal_full.py
. If you want to use a specific GPU, you can set the CUDA_VISIBLE_DEVICES
environment variable before running the command. For example, to use GPU 0, run CUDA_VISIBLE_DEVICES=0 torchrun standlone --nnodes=1 --nproc_per_node=1 scripts/train_multimodal_best.py
.
Multimodal evaluation for RNA retrieval can be run with uv run scripts/evaluate_retrieval.py
. Use MLflow to track the results.
Grad-CAM visualization can be run with uv run scripts/compute_rna_gradcams.py
and uv run scripts/compute_wsi_gradcams.py
. Use MLflow to view the results.
The scripts/variants
directory contains scripts to train and evaluate variants of the model, including the model with frozen encoders (RNA and WSI) and the model using chromosome based clustering for the RNA encoder.
The results shown here are in part based upon data generated by the TCGA Research Network. The authors would like to thank Caroline Delmas and Antony Lemarié (INSERM UMR 1037 Cancer Research Center of Toulouse (CRCT), University Paul Sabatier Toulouse III) for WSI and RNA-seq data acquisition.
If you use this code or data as part of your workflow in a scientific publication, please consider citing the 🗞️ official paper:
@article{berjaoui2024uncovering,
title={Uncovering the Genetic Basis of Glioblastoma Heterogeneity through Multimodal Analysis of Whole Slide Images and RNA Sequencing Data},
author={Berjaoui, Ahmad and Roussel, Louis and Sanchez, Eduardo Hugo and Moyal, Elizabeth Cohen-Jonathan},
journal={arXiv preprint arXiv:2410.18710},
year={2024}
}
This code is licensed under the GPL v3.0 license