[ChrisCell] Illuminating cell states by a comprehensive and interpretable single cell foundation model
ChrisCell is a comprehensive and interpretable single-cell foundation model. ChrisCell innovatively integrates a Single-Cell Discretization (SCD) module into the single-cell foundation model. This module utilizes a unified cell codebook to transform the cell representation into a cell code and uses an SCD cell embedding derived from the cell codebook to represent the cell corresponding to the code. ChrisCell employs an encoder-SCD-decoder architecture, encompassing 511 million parameters and pretrained on over 68 million single-cell data points.
Advances in high-throughput single-cell sequencing techniques have enabled AI-driven methods to harness extensive datasets, resulting in the emergence of robust single-cell foundation models that demonstrate exceptional representation capabilities across various downstream applications. However, current evidence suggests that the practical utility and generalizability of these foundation models are significantly constrained by the sparsity and heterogeneity of real-world data, as well as their limited interpretability. To truly enhance single-cell research and facilitate drug discovery, a foundation model must exhibit improved generalizability across diverse real-world data sources and provide clear interpretations for novel biological insights.
- Advanced cell representation and Generalizability: ChrisCell can be applied to a wide range of single-cell tasks, categorized into prediction and analysis tasks. In the realm of prediction tasks, which encompass cell clustering, annotation, property prediction, gene and drug perturbation prediction, as well as drug response prediction, ChrisCell demonstrates significant advancements over other foundation model.
- Effective Quantization: ChrisCell effectively represents high-dimensional and multi-modal single-cell data using a single token, minimizing information loss.
- Interprebility: The VQ and ChrisCell-graph module provides quantized stastics that help interpret the significance of each gene or property in relation to the cell state. It greatly empowers the analysis tasks. By integrating ChrisCell and ChrisCell-Graph, the models can be used in four distinct tasks: cell state discovery, gene discovery, gene regulatory network (GRN) analysis, and multimodal analysis.
For more details on the performance and benchmarking, please refer to our paper.
To get started with ChrisCell, follow these steps:
-
Clone the repository:
git clone https://github.com/A4Bio/ChrisCell.git cd ChrisCell
-
Create and activate the conda environment:
conda create -n chriscell python=3.9.17 conda activate chriscell ./install.sh
-
Download the pretrained Model:
We provide a pretrained model for ChrisCell. Download it here and place it in the pretrained_models directory. We also provide the test data in Download it here.
-
Download the pretraining dataset: To download the pretraining dataset, users can refer to the link.
To quickly try out ChrisCell using an example dataset, run the following command:
bash run_example.sh
This script runs the inference.py
script with sample data provided in the examples
folder.
We also provide an example tutorial in quick_start.ipynb
.
The inference.py
script supports several command-line arguments:
Argument | Description | Default |
---|---|---|
--data_path |
Path to the dataset. | None |
--model_path |
Path to the pretrained model checkpoint. | pretrained_model/checkpoint.pt |
--save_path |
Path to save the output of ChrisCell. | example |
--device |
Device to run the model on (cpu or cuda ). |
cuda |
--verbose |
Enable verbose output for debugging. | Disabled |
--mode |
Model mode. | m1 |
You can run the example directly from the command line:
python inference.py
To use ChrisCell with your own data, you need to provide a scRNA-seq or scATAC-seq dataset. For example:
python inference.py --data_path /path/to/your/dataset --save_path /path/to/your/save/path --mode m1 --device cuda
For a complete description of the method, see:
TBD
Please submit any bug reports, feature requests, or general usage feedback as a github issue or discussion.
- Jue Wang (wangjue@westlake.edu.cn)
- Cheng Tan (tancheng@westlake.edu.cn)
- Zhangyang Gao (gaozhangyang@westlake.edu.cn)
This project is licensed under the MIT License.