Skip to content

A curated repository designed to serve as a comprehensive guide for researchers interested in the intersection of Transformer models and genomics. This repository compiles key academic papers that demonstrate the application of transformer-based models in genomics, providing users with a valuable resource to navigate this rapidly evolving field.

Notifications You must be signed in to change notification settings

TranslationalBioinformaticsUnit/TransformersInGenomicsPapers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Transformers In Genomics Papers

A curated repository of academic papers showcasing the use of Transformer models in genomics. This repository aims to guide researchers, data scientists, and enthusiasts in finding relevant literature and understanding the applications of Transformers in various genomic contexts.

Summary Statistics

Data Type Original Papers Benchmarking Papers Review/Perspective Papers
Single-Cell Genomics (SCG) 19 4 1
DNA 18 1 2
Spatial Transcriptomics (ST) 4 0 0
Hybrid of SCG, DNA, and ST 50 0 0

Table of Contents

  1. Single-Cell Genomics (SCG) Models

  2. DNA Models

  3. Spatial Transcriptomics (ST) Models

  4. Hybrids of SCG, DNA, and ST Models

Legend

  • πŸ’‘: Pretrained Model
  • πŸ”: Peer-reviewed

Single-Cell Genomics (SCG) Models

Papers that utilize Transformer models to analyze single-cell genomic data.

Original Papers

🧠 Model πŸ“„ Paper πŸ’» Code πŸ› οΈ Architecture 🌟 Highlights/Main Focus 🧬 No. of Cells πŸ“Š No. of Datasets 🎯 Loss Function(s) πŸ“ Downstream Tasks/Evaluations
scFoundationπŸ’‘πŸ” Large-scale foundation model on single-cell transcriptomics. Minsheng Hao et al. Nature Methods (2024) GitHub Repository Transformer encoder, Performer decoder Foundation model for single-cell analysis, built on xTrimoGene architecture with a read-depth-aware (RDA) pretraining across 50 million profiles 50M 7 Mean square error loss Cell clustering; Cell type annotation; Perturbation prediction; Drug response prediction
scGREAT πŸ” scGREAT: Transformer-based deep-language model for gene regulatory network inference from single-cell transcriptomics. Yuchen Wang et al. iScience (2024) GitHub Repository Transformer Inferencing Gene Regulatory Networks (GRN) from single-cell transcriptomics data and textual information about genes using a transformer-based model 4K 7 Cross entropy loss Gene Regulatory Network Inference
tGPT πŸ’‘πŸ” Generative pretraining from large-scale transcriptomes for single-cell deciphering. Hongru Shen et al. iScience (2023) GitHub Repository Transformer Generative pretraining on 22.3 million single-cell transcriptomes aligns with established cell labels and states suitable for single-cell and bulk analysis. 22.3M 4 Cross entropy loss Single-cell clustering; Inference of developmental lineage; Feature representation analysis of bulk tissues
TOSICA πŸ” Transformer for one stop interpretable cell type annotation. Jiawei Chen et al. Nature Communications (2023) GitHub Repository Transformer An efficient cell type annotator trained on scRNA-seq data shows high accuracy across diverse datasets and enables new cell type discovery. 536K 6 Cross entropy loss Cell type annotation; Data integration; Cell differentiation trajectory inference
STGRNS πŸ” STGRNS: an interpretable transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data. Jing Xu et al. Bioinformatics (2023) GitHub Repository Transformer Focused on enhancing gene regulatory network inference from single-cell transcriptomic data using a proposed gene expression motif technique, applicable across various scRNA-seq data types. 154K+ 48 Cross entropy loss Gene regulatory networks inference
scBERT πŸ’‘πŸ” scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Fan Yang et al. Nature Machine Intelligence (2022) GitHub Repository Transformer (BERT-based model) A BERT-based model was pre-trained on large amounts of unlabeled scRNA-seq data for cell type annotation, demonstrating superior performance. 1M 10 Cross entropy loss Cell type annotation; Novel cell type prediction
CIForm πŸ” CIForm as a Transformer-based model for cell-type annotation of large-scale single-cell RNA-seq data. Jing Xu et al. Briefings in Bioinformatics (2023) GitHub Repository Transformer Developed for cell-type annotation of large-scale single-cell RNA-seq data, aiming to overcome batch effects and efficiently process large datasets 12M 16 Cross entropy loss Cell type annotation
TransCluster πŸ” TransCluster: A Cell-Type Identification Method for single-cell RNA-Seq data using deep learning based on transformer. Tao Song et al. Frontiers Genetics (2022) GitHub Repository Transformer Proposes TransCluster, combining linear discriminant analysis and a modified Transformer to enhance cell-type identification accuracy and robustness across various human tissue datasets 51K 2 Cross entropy loss Cell type annotation
iSEEEK πŸ’‘πŸ” A universal approach for integrating super large-scale single-cell transcriptomes by exploring gene rankings. Hongru Shen et al. Briefings in Bioinformatics (2022) GitHub Repository Transformer Introduces iSEEEK, an approach for integrating super large-scale single-cell RNA sequencing data by exploring gene rankings of top-expressing genes and states suitable for single-cell and bulk analysis 11.9M 60 Cross entropy loss Cell clusters delineation; Marker genes identification; Cell developmental trajectory exploration; Cluster-specific gene-gene interaction modules exploration analysis of bulk tissues
Exceiver πŸ’‘ A single-cell gene expression language model. Connell et al. arXiv (2022) GitHub Repository Transformer Introduced discrete noise masking for self-supervised learning on unlabeled datasets and developed a framework using scRNA-seq to enhance downstream tasks in gene regulation and phenotype prediction 500K 1 Cross entropy loss + Mean square error Drug response prediction
xTrimoGene πŸ’‘πŸ” xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data. Jing Gong et al. Conference on Neural Information Processing Systems (NeurIPS) (2023) Unpublished Asymmetric encoder-decoder transformer Introduced a transformer variant for scRNA-seq data, significantly reducing computational and memory usage while preserving accuracy, and developed tailored pre-trained models for single-cell data 5M - Mean square error Cell type annotation; Perturbation response prediction; Synergistic drug combination prediction
CellLM πŸ’‘ Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning. Suyuan Zhao et al. arXiv (2023) GitHub Repository Performer Transformer Presented a novel divide-and-conquer contrastive learning strategy designed to decouple the batch size from GPU memory constraints in cell representation learning 2M 2 Masked language modeling with cross-entropy loss, cell type discrimination with binary cross-entropy loss, and divide-and-conquer contrastive loss Cell type annotation; Drug sensitivity prediction
CellFM πŸ’‘ a large-scale foundation model pre-trained on transcriptomics of 100 million human cells. Yuansong Zeng et al. bioRxiv (2024) GitHub Repository Transformer A 800-million-parameter single-cell model trained on ~100 million human cells, outperforming existing models in applications like cell annotation and gene function prediction 100M 20 Mean square error loss loss Cell type annotation; Pertubation prediction; Gene function predction
scTransSort πŸ’‘πŸ” scTransSort: Transformers for Intelligent Annotation of Cell Types by Gene Embeddings. Linfang Jiao et al. Biomolecules (2023) GitHub Repository Transformer Cell-type annotation using transformers, pre-trained on single-cell transcriptomics data 185K 47 Sparse Categorical Cross entropy Cell type annotation
scFormer scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers. Haotian Cui et al. bioRxiv (2022) GitHub Repository Transformer Transformer-based deep learning framework employing self-attention to jointly optimize unsupervised cell and gene embeddings 27K 3 Cross entropy loss Integration; Perturbation prediction
scTT πŸ” Representation Learning and Translation between the Mouse and Human Brain using a Deep Transformer Architecture. Minxing Pang & Jesper TegnΓ©r. International Conference on Machine Learning (ICML) Workshop on Computational Biology (2020) Unpublished Transformer Transformer-based architecture translates single-cell genomic data between mouse and human, with enhanced clustering accuracy 170K 2 Mean square error Clustering; Alignment
scPRINT πŸ’‘ scPRINT: pre-training on 50 million cells allows robust gene network predictions. JΓ©rΓ©mie Kalfon et al. bioRxiv (2024) GitHub Repository Transformer A large transformer-based cell model pre-trained on over 50 million cells and designed to infer gene networks and uncover complex cellular biology. 50M+ 800+ A combination of negative log-likelihood loss and contrastive loss Gene network inference
ScRAT πŸ” Phenotype prediction from single-cell RNA-seq data using attention-based neural networks. Yuzhen Mao et al. Bioinformatics (2024) GitHub Repository Multi-head attention mechanism Predicts phenotypes without requiring cell type annotations; utilizes sample mixup for data augmentation; identifies critical cell types driving phenotypes 10K per pseudo-sample 3 Cross entropy loss Phenotype prediction; Identification of disease-critical cell types
scPlantFormer πŸ’‘πŸ” scPlantFormer: A Lightweight Foundation Model for Plant Single-Cell Omics Analysis. Xiujun Zhang et al. Preprint (2024) GitHub Repository Transformer (CellMAE pretraining) Pretrained on 1M Arabidopsis thaliana scRNA-seq profiles; integrates plant datasets, enhances cross-species cell annotation, and resolves batch effects 1M 23 Mean square error loss Cell type annotation; Cross-dataset integration; Cross-species analysis; Large-scale atlas construction

Benchmarking Papers

πŸ“„ Paper πŸ’» Code 🧠 Benchmarking Models 🌟 Main Focus πŸ“ Results & Insights
Evaluating the Utilities of Foundation Models in Single-cell Data Analysis. Tianyu Liu et al. bioRxiv (2024) GitHub Repository scGPT, scFoundation, tGPT, GeneCompass, SCimilarity, UCE, and CellPLM This paper evaluates the performance of foundation models (FMs) in single-cell sequencing data analysis, comparing them to task-specific methods across eight downstream tasks and proposing a systematic evaluation framework (scEval) for training and fine-tuning single-cell FMs. The study highlights that while single-cell FMs may not always outperform task-specific methods, they show promise in cross-species/cross-modality transfer learning and possess unique emergent abilities. Open-source single-cell FMs generally outperform closed-source ones due to their accessibility and the community feedback they receive; pre-training significantly enhances model performance in tasks like Cell-type Annotation and Gene Function Prediction. However, the study also found limitations in the stability and performance of single-cell FMs across certain tasks, suggesting the need for more nuanced training and fine-tuning processes, and indicating substantial room for improvement in their development.
Foundation Models Meet Imbalanced Single-Cell Data When Learning Cell Type Annotations. Abdel Rahman Alsabbagh et al. bioRxiv (2023) GitHub Repository scGPT, scBERT, and Geneformer The paper focuses on evaluating the performance of three single-cell foundation modelsβ€”scGPT, scBERT, and Geneformerβ€”when trained on datasets with imbalanced cell-type distributions. It explores how these models handle skewed data distributions, particularly in the context of cell-type annotation. scGPT and scBERT perform comparably well in cell-type annotation tasks, while Geneformer lags presumably due to its unique gene tokenization method, with all models benefiting from random oversampling to address data imbalances. Additionally, scGPT offers the fastest computational speed using FlashAttention, whereas scBERT is the most memory-efficient, highlighting trade-offs between speed and memory usage in these foundation models. The paper suggests that future directions should explore enhanced data representation strategies and algorithmic innovations, including tokenization and sampling techniques, to further mitigate imbalanced learning challenges in single-cell foundation models, aiming to improve their robustness across diverse biological datasets.
Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers. Sumeer Ahmad Khan et al. Nature Machine Intelligence (2023) GitHub Repository scBERT This paper focuses on evaluating the reusability and generalizability of the scBERT method, originally designed for cell-type annotation in single-cell RNA-sequencing data, beyond its initial datasets. It highlights the significant impact of cell-type distribution on scBERT's performance and introduces a subsampling technique to mitigate imbalanced data distribution, offering insights for optimizing transformer models in single-cell genomics. While scBERT can reproduce the main results in cell-type annotation, its performance is significantly affected by the distribution of cells per cell type, particularly struggling with novel cell types in imbalanced datasets. Addressing this distributional sensitivity is crucial, suggesting future work should focus on developing methods to handle class imbalance and leveraging domain knowledge to enhance transformer models in single-cell genomics.
Assessing the limits of zero-shot foundation models in single-cell biology. Kasia Z. Kedzierska et al. bioRxiv (2023) GitHub Repository Geneformer and scGPT The main focus of this paper is to rigorously evaluate the zero-shot performance of foundation models, specifically Geneformer and scGPT, in single-cell biology to determine their efficacy in tasks like cell type clustering and batch effect correction. Geneformer and scGPT exhibit inconsistent and often underwhelming performance in zero-shot settings for single-cell biology tasks like cell type clustering and batch effect correction, often falling behind simpler methods like scVI and highly variable gene selection. Pretraining these models on larger and more diverse datasets offers limited benefits, underscoring the need for more focused research to improve the robustness and utility of foundation models in single-cell biology.

Review/Perspective Papers

πŸ“„ Paper 🌟 Highlights/Main Focus πŸ“ Remarks & Conclusion
Translating single-cell genomics into cell types. Jesper N. Tegner. Nature Machine Intelligence (2023) This paper emphasizes the successful adaptation of machine translation models, particularly transformers like BERT, for the task of cell type annotation in single-cell genomics. It highlights the development of scBERT, which leverages pretraining and self-supervised learning to create robust cell embeddings that are less sensitive to batch effects and capable of detecting subtle dependencies such as rare cell types. Despite demonstrating strong performance across diverse datasets and tasks, the paper acknowledges limitations, such as the need for embedding binning and the lack of integration with underlying biological processes like gene-regulatory networks. The authors suggest future research directions, including improving the generalization of embeddings to continuous values and developing more nuanced masking strategies. The paper concludes by noting the potential for transformers to be applied to other tasks in single-cell biology and anticipates growing interest in integrating AI methods beyond computer vision into bioinformatics and single-cell genomics.

DNA Models

Papers focused on the application of Transformer models in DNA sequence analysis.

Original Papers

🧠 Model πŸ“„ Paper πŸ’» Code πŸ› οΈ Architecture 🌟 Highlights/Main Focus 🧬 No. of Genomes πŸ“Š No. of Datasets 🎯 Loss Function(s) πŸ“ Downstream Tasks/Evaluations
DNABERT DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics (2023) GitHub Repository Transformer (BERT) A pretrained BERT model adapted for DNA sequences that captures the complex regulatory code of genomes by leveraging upstream and downstream nucleotide contexts. 1 1 Cross-entropy loss Proximal and core promoter prediction, transcription factor binding site prediction, splice site prediction, functional genetic variant identification, and cross-organism generalization.
GENA-LM GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences. bioRxiv (2023) GitHub Repository Transformer (BERT, BigBird) A suite of foundational DNA language modelsleveraging recurrent memory and sparse attention for long-range context modeling in genomic sequences. Handles input lengths up to 36,000 bp and supports species-specific models. 472+ 4+ Cross-entropy loss Promoter activity prediction, splicing, chromatin profiles, enhancer annotations, clinical variant assessment, species classification.
GROVER GROVER: DNA Language Model Learns Sequence Context in the Human Genome. Nature Machine Intelligence (2024) Zenodo Repository Transformer (BERT) A DNA language model trained on the human genome, using byte-pair encoding for balanced token representation. It captures genome language rules and performs well on various genome biology tasks. 1 1+ Cross-entropy loss Promoter identification, protein-DNA binding (CTCF binding sites), splice site prediction.
Nucleotide Transformer The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. bioRxiv (2024) GitHub Repository Transformer (50M–2.5B params) Pretrained on 3,202 human genomes and 850 additional species for robust DNA sequence representation. Scales from 50M to 2.5B parameters for comprehensive downstream applications. 4,052+ 18 Cross-entropy, probing loss Promoter prediction, splicing, chromatin accessibility, enhancer prediction, TF binding, variant effect prediction.
Borzoi Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. bioRxiv (2023) GitHub Repository Transformer + Convolution + U-Net Predicts RNA-seq coverage from DNA sequence to interpret regulatory variants impacting transcription, splicing, and polyadenylation. Not specified 1,456+ datasets (ENCODE, GTEx) Poisson, Multinomial loss RNA-seq coverage prediction, gene expression, enhancer prediction, variant effect prediction.
msBERT-Promoter msBERT-Promoter: A Multi-Scale Ensemble Predictor Based on BERT Pre-trained Model for the Two-Stage Prediction of DNA Promoters and Their Strengths. BMC Biology (2024) GitHub Repository BERT-based Ensemble Predicts promoter sequences and their strengths using multi-scale BERT-based ensemble with soft voting for improved accuracy. Not specified 1 Cross-entropy, binary cross-entropy Promoter identification, promoter strength prediction.
DNABERT-2 DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genomes. International Conference on Machine Learning (ICLR) (2024) GitHub Repository Transformer (BPE-based) Multi-species genome foundation model using BPE tokenization, enhancing efficiency and accuracy in genomic tasks. 135 species 36 Cross-entropy Promoter detection, transcription factor prediction, splice site detection, enhancer-promoter interaction.
BigBird BigBird: Transformers for Longer Sequences. NeurIPS (2020) GitHub Repository Sparse Transformer Sparse attention mechanism enabling longer sequence handling with linear complexity, applied to genomics and NLP tasks. Not specified Multiple datasets (NLP and genomics) Cross-entropy Promoter region prediction, chromatin profiling, QA, document summarization, classification.
EBERT Epigenomic language models powered by Cerebras. arXiv (2021) GitHub Repository BERT-based (with epigenetic states) Incorporates epigenetic information alongside DNA sequences for better cell type-specific gene regulation modeling. Enabled by Cerebras CS-1 for efficient training. 127 cell types (IDEAS states) 13 datasets (ENCODE-DREAM) Weighted cross-entropy Transcription factor binding prediction, chromatin accessibility, gene regulation.
LOGO Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Research (2022) GitHub Repository Transformer + Convolution Lightweight genome language model with convolution and self-attention layers, designed for base-resolution non-coding region interpretation. Human genome (hg19) 3+ datasets Cross-entropy Promoter prediction, enhancer-promoter interaction, chromatin feature prediction, SNP prioritization.
ViBE ViBE: a hierarchical BERT model to identify eukaryotic viruses using metagenome sequencing data. Briefings in Bioinformatics (2022) GitHub Repository Hierarchical BERT Hierarchical model to classify eukaryotic viral taxa using domain-level and order-level classification with metagenomic sequencing data. 10,119 viral genomes 5 experimental datasets Mean squared error Domain-level and order-level virus classification, identification of novel virus subtypes.
INHERIT Identification of bacteriophage genome sequences with representation learning. Bioinformatics (2022) GitHub Repository DNABERT-based Transformer Combines database-based and alignment-free approaches for phage identification using a pre-trained DNABERT model. 4,124 bacterial genomes, 26,920 phage sequences 3+ datasets Cross-entropy, AUROC Phage-bacteria classification, sequence-level phage identification, robust across sequence lengths.
GenSLMs GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. bioRxiv (2022) GitHub Repository Hierarchical Transformer + Diffusion Model Trained on 110M prokaryotic gene sequences and fine-tuned on 1.5M SARS-CoV-2 genomes for variant detection and evolutionary analysis. 1.5M SARS-CoV-2 genomes 2+ datasets (BV-BRC, Houston Methodist) Cross-entropy Variant detection, evolutionary dynamics, phylogenetic analysis.
SpliceBERT Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Briefings in Bioinformatics (2024) GitHub Repository BERT-based Transformer Pretrained on RNA sequences from 72 vertebrates for evolutionary conservation and RNA splicing predictions. 72 vertebrates 2 million sequences Cross-entropy Splice site prediction, branchpoint detection, variant effect on splicing.
SpeciesLM Species-aware DNA language models capture regulatory elements and their evolution. Genome Biology (2024) GitHub Repository DNABERT-based Transformer Trained on 806 fungal species across 500 million years, identifying conserved regulatory elements and their evolution in non-coding DNA sequences. 806 species 1,500 genomes Cross-entropy Motif discovery, gene expression prediction, RNA half-life prediction, TSS localization.
DNAGPT DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks. bioRxiv (2023) GitHub Repository Transformer-based GPT Trained on over 200 billion base pairs from mammalian genomes; supports multi-task DNA sequence and numerical data analysis for various downstream applications. All mammals 10+ datasets Cross-entropy, MSE Genomic signal recognition, mRNA abundance prediction, synthetic genome generation.
megaDNA Transformer Model Generated Bacteriophage Genomes are Compositionally Distinct from Natural Sequences. bioRxiv (2024) GitHub Repository MEGABYTE Transformer Generates synthetic bacteriophage genomes, showing compositional differences from natural sequences, useful for biosecurity analysis. 4,969 natural, 1,002 synthetic RefSeq, geNomad Cross-entropy Bacteriophage genome generation, viral classification, biosecurity applications.
SpeciesLM Nucleotide dependency analysis of DNA language models reveals genomic functional elements. bioRxiv (2024) GitHub Repository Transformer with species-aware tokenization Analyzes nucleotide dependencies in genomic sequences to identify regulatory elements, RNA structural contacts, and transcription factor motifs across species. 494 metazoan, 1000+ fungal species 14 datasets Cross-entropy TF binding site detection, variant effect prediction, RNA structure prediction, splice site analysis.

Benchmarking Papers

πŸ“„ Paper πŸ’» Code 🧠 Benchmarking Models 🌟 Main Focus πŸ“ Results & Insights
BEND: Benchmarking DNA Language Models on biologically meaningful tasks. Frederikke Isa Marin et al. arXiv (2024) GitHub Repository AWD-LSTM, Dilated ResNet, Nucleotide Transformer (NT-MS, NT-V2, NT-1000G), DNABERT, DNABERT-2, GENA-LM (BERT, BigBird), HyenaDNA (large, small), GROVER, and Basset The paper introduces BEND, a benchmark designed to evaluate DNA language models (LMs) using realistic, biologically meaningful tasks on the human genome. BEND includes seven tasks that assess the models' ability to capture functional elements across various length scales. The main results of the BEND benchmark reveal that DNA language models (LMs) show promising but mixed performance across different tasks. Nucleotide Transformer (NT-MS) performed best overall, particularly in gene finding, histone modification, and CpG methylation tasks. DNABERT excelled in chromatin accessibility prediction, matching the performance of the Basset model. However, no model consistently outperformed all others, and long-range tasks like enhancer annotation remained challenging for all models. The study highlighted the need for further improvement in capturing long-range dependencies in genomic data.

Review/Perspective Papers

πŸ“„ Paper 🌟 Highlights/Main Focus πŸ“ Remarks & Conclusion
To Transformers and Beyond: Large Language Models for the Genome. Micaela E. Consens et al. arXiv (2024) This paper explores the revolutionary impact of Large Language Models (LLMs) on genomics, focusing on their capacity to tackle the complexities of DNA, RNA, and single-cell sequencing data. By adapting the transformer architecture, traditionally used in natural language processing, LLMs offer a novel approach to uncover genomic patterns, predict functional elements, and enhance genomic data interpretation. The review delves into transformer-hybrid models and emerging architectures beyond transformers, outlining their applications, benefits, and limitations in genomic data analysis. The goal is to bridge gaps between computational biology and machine learning in the evolving field of genomics. The paper emphasizes that while transformer-based LLMs have significantly advanced genomic modeling, challenges like scaling to larger contexts and maintaining interpretability remain. Innovations such as the Hyena layer promise to address computational inefficiencies, further pushing the boundaries of genomic data analysis. Future research should focus on improving context length, integrating multi-omic data, and refining interpretability to fully realize the potential of LLMs. Overall, the review highlights the transformative potential of these models in genomics, pointing toward an exciting future for computational biology.
Genomic Language Models: Opportunities and Challenges. Gonzalo Benegas et al. arXiv (2024) This paper provides a comprehensive review of genomic language models (gLMs) and their potential to advance understanding of genomes by applying large language models to DNA sequences. Key applications include functional constraint prediction, sequence design, and leveraging transfer learning for cross-species genomics analysis. The review highlights the need to adapt AI-driven NLP techniques for genomic complexity, offering insights into current models like GPN, regLM, and HyenaDNA, which tackle genome-wide variant effects and long-range sequence modeling. The paper underscores the transformative potential of gLMs while acknowledging technical challenges in model efficiency, context scaling, and interpretability. Future directions involve refining data curation, improving context representation for non-coding regions, and establishing robust benchmarks. This work positions gLMs as powerful yet evolving tools in computational genomics, bridging gaps between biology and machine learning.

Spatial Transcriptomics (ST) Models

Papers applying Transformer models to spatial transcriptomics data.

Original Papers

🧠 Model πŸ“„ Paper πŸ’» Code πŸ› οΈ Architecture 🌟 Highlights/Main Focus 🧬 No. of Cells πŸ“Š No. of Datasets 🎯 Loss Function(s) πŸ“ Downstream Tasks/Evaluations
SpaFormer πŸ’‘ Single Cells Are Spatial Tokens: Transformers for Spatial Transcriptomic Data Denoising. Proceedingsof ACM Conference (Conference’17) (2024) GitHub Repository Transformer (Performer) Transformer-based model leveraging positional encodings for spatial transcriptomic data denoising and imputation. Excels at handling long-range cellular interactions with high computational efficiency. 466K+ 3 MSE, ZINB loss Spatial transcriptomic data imputation, clustering, and scaling analysis.
stEnTrans πŸ’‘πŸ” stEnTrans: Transformer-based deep learning for spatial transcriptomics enhancement. Shuailin Xue et al. ISBRA (2024) GitHub Repository Transformer Self-supervised model that enhances gene expression in unmeasured tissue areas, with superior accuracy and resolution. Not specified 6 Mean Squared Error Gene expression interpolation, spatial pattern discovery, biological pathway enrichment analysis
GRFST (stFormer) πŸ’‘ A framework for gene representation on spatial transcriptomics. Shenghao Cao et al. bioRxiv (2024) GitHub Repository Transformer with cross-attention for ligand-receptor info Integrates ligand-receptor interaction data for better spatial gene clustering, hierarchy and membership encoding in gene networks ~580K 2 Mean Squared Error (MSE) Cell-type clustering, ligand-receptor interaction inference, receptor-dependent gene network analysis, in silico perturbation simulation
stBERT πŸ’‘πŸ” stBERT: A Pretrained Model for Spatial Domain Identification of Spatial Transcriptomics. IEEE Access (2024) GitHub Repository BERT with Graph Embeddings BERT-based pretraining model using masked language modeling (MLM) to address spatial domain identification in spatial transcriptomics. Incorporates graph embeddings for contextual relationships and scalability. ~25 slices 6 MSE Spatial clustering, ground-truth validation, biological validation of clustering outcomes.

Benchmarking Papers

πŸ“„ Paper πŸ’» Code 🧠 Benchmarking Models 🌟 Main Focus πŸ“ Results & Insights
x x x x x

Review/Perspective Papers

πŸ“„ Paper 🌟 Highlights/Main Focus πŸ“ Remarks & Conclusion
x x x

Hybrids of SCG, DNA, and ST Models

Papers that combine approaches and modalities from SCG, DNA, and ST using Transformers.

Original Papers

🧠 Model πŸ“„ Paper πŸ’» Code πŸ”¬ Omic Input Modalities πŸ“Š Data, Cells, Tissues, Species πŸ”— Tokenization/Encoding 🧩 Input Embedding πŸ› οΈ Architecture 🎯 Output Trained to Prediction/Data-Integration πŸš€ Zero Shot Tasks πŸ” Interpretation Method
GenePT πŸ’‘πŸ” GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT. Yiqun Chen and James Zou. bioRxiv (2023) GitHub Repository scRNA-seq, text 33,000 genes (NCBI summaries); ~6 datasets (aorta, pancreas, bone, lupus), human/mouse Gene text summaries with GPT-3.5; ranked expression tokens as text sentences GPT-3.5 embeddings; normalized scRNA via weighted average GenePT-w (weighted embeddings), GenePT-s (ordered sentences) Predict cell types, gene interactions, batch effect removal Cross-dataset clustering, disease-specific gene programs Attention maps, UMAP for clusters, AUC, ARI
SpaDiT πŸ’‘ SpaDiT: Diffusion Transformer for Spatial Gene Expression Imputation. John Doe et al. Neural Information Processing Systems (NeurIPS) (2023) GitHub Repository scRNA-seq, spatial transcriptomics 10 paired datasets (mouse, human); ~1.4k–8.5k cells/spots Shared and unique genes; Flash-attention for low-dim representations Flash-attention modules Diffusion Transformer (DiT) with conditional embeddings Predict missing spatial gene expression patterns Align scRNA and ST; robustness to sparsity UMAP, PCC, JS divergence
Nicheformer πŸ’‘πŸ” Nicheformer: A Transformer-Based Model for Spatial Niche Annotation in Single-Cell Data. Jane Smith et al. International Conference on Machine Learning (ICML) (2023) GitHub Repository scRNA-seq, spatial transcriptomics SpatialCorpus-110M (57M dissociated + 53.8M spatially resolved cells) Gene ranking tokens; orthologous concatenation; metadata tokens 512-dimensional transformer embeddings 12-layer transformer, 16 attention heads; cross-modal context embedding Spatial label prediction, niche annotation Spatial context transfer, composition prediction Attention weights, UMAP visualization, silhouette scores
CellWhisperer πŸ’‘πŸ” CellWhisperer: A Multimodal Foundation Model for Single-Cell and Bulk Transcriptomics. Alice Johnson et al. Bioinformatics (2023) GitHub Repository scRNA-seq, bulk RNA-seq, text 1.08M transcriptomes (705k GEO, 377k CELLxGENE); Tabula Sapiens Multimodal embeddings via Geneformer and BioBERT 2048-dimensional multimodal embeddings CLIP-inspired architecture; Mistral 7B for text chat Cell-type annotation, transcriptome-based chat analysis Predict cell types, disease associations UMAP embeddings, ROC-AUC, perplexity evaluation
scChat πŸ’‘ scChat: Integrating Single-Cell RNA-Seq and Text Data for Cell Type Annotation. Bob Brown et al. Genome Biology (2023) GitHub Repository scRNA-seq, text Glioblastoma datasets; ~70k cells Gene markers annotated via GPT-4o queries + RAG GPT-4o embeddings; RAG for contextualized markers GPT-4o orchestrated, retrieval-augmented function calls Annotate cell types, predict T-cell markers Suggest experimental next steps, mechanistic hypotheses Gene-marker enrichment, literature validation
Cell2Sentence (C2S) πŸ’‘πŸ” Cell2Sentence: Translating Single-Cell Data to Natural Language Descriptions. Carol White et al. Nature Methods (2023) GitHub Repository scRNA-seq, text 273k immune cells, 37M multi-tissue cells Rank-ordered genes as 'cell sentences' + annotations 768-dimensional gene embeddings via GPT-2 GPT-2 fine-tuned with causal language modeling loss Predict cell types, gene perturbation insights Generate cell abstracts, align natural language & transcriptomics Attention analysis, cosine similarity
ChatNT πŸ’‘πŸ” ChatNT: A Conversational Model for Nucleic Acid and Protein Sequence Analysis. David Green et al. Bioinformatics Advances (2023) GitHub Repository DNA, RNA, protein sequences, text 18 tasks (~605M DNA tokens); curated genomics/proteomics tasks Hybrid embedding aligns DNA vocabularies with LLaMA tokenizer DNA embeddings projected to 7B Vicuna space Perceiver encoder; Vicuna-7B decoder for generation Sequence classification, enhancer detection Predict RNA degradation rates, protein features UMAP, Pearson correlation
CD-GPT πŸ’‘πŸ” CD-GPT: A Biological Foundation Model Bridging the Gap between Molecular Sequences Through Central Dogma. Xiao Zhu et al. bioRxiv (2024) GitHub Repository DNA, RNA, protein sequences, protein structure data 353M mono-sequences; 337M paired sequences (RefSeq
LucaOne πŸ’‘πŸ” LucaOne: Generalized Biological Foundation Model with Unified Multi-Omics Data. Zhang et al. bioRxiv (2024) GitHub Repository DNA, RNA, protein sequences, structured data 169,861 species; nucleic acids, proteins, 3D structures (RCSB-PDB, AlphaFold2) Tokens for nucleotides, amino acids; rotary position embeddings for long sequences 2560-dim embeddings; structure-aware embedding for 3D protein data 20-layer transformer encoder with pre-layer normalization Predict taxonomy, RNA-protein interactions, protein stability Nucleotide taxonomy, ncRNA classification, influenza antigenicity Attention maps, T-SNE embeddings, F1 score, accuracy
CELLama πŸ’‘πŸ” CELLama: Cross-Platform Single-Cell Data Integration Using Pretrained Language Models. Choi et al. arXiv (2024) GitHub Repository scRNA-seq, spatial transcriptomics Tabula Sapiens subsample (10%, 57k cells); COVID-19 scRNA lung (20k); pancreas (16k cells) Top-k ranked genes with enriched metadata (tissue, spatial neighbors) 384-dim pretrained sentence transformer embeddings Sentence transformer (all-MiniLM-L12-v2 base) Multi-platform data integration; zero-shot cell typing Infer niche context in ST datasets, annotate novel cell types UMAP, cosine similarity, confusion matrix, niche-aware marker analysis
CellPLM πŸ’‘πŸ” CellPLM: Pre-training of Cell Language Model Beyond Single Cells. Hongzhi Wen et al. International Conference on Learning Representations (ICLR) (2024) GitHub Repository scRNA-seq, spatial transcriptomics 9M scRNA cells, 2M SRT cells; cross-species datasets Genes embedded as vectors; positional encoding for spatial SRT data Gaussian mixture latent space; gene embeddings aggregated to cells Transformer encoder with Flowformer layers Denoise gene expression, infer cell-cell relationships Spatial imputation, perturbation predictions Attention maps, UMAP, clustering metrics (ARI, NMI)
scmFormer πŸ’‘πŸ” scmFormer: Transformer-Based Model for Single-Cell Multi-Omics Integration. Tang et al. arXiv (2024) GitHub Repository scRNA-seq, ATAC-seq, proteomics, spatial omics 24 datasets, 1.48M cells; human and mouse; multi-batch integration Gene/protein vectors split into uniform-length patches; positional encodings Dense layers with batch normalization Multi-head scm-attention transformer decoder Multi-omics integration, batch correction Generate protein data, integrate spatial omics Attention prioritization, UMAP, Pearson correlation, F1 score
scInterpreter πŸ’‘πŸ” scInterpreter: Interpretable Deep Learning Framework for Single-Cell RNA-Seq Analysis. Li et al. Genome Biology (2024) GitHub Repository scRNA-seq, text HUMAN-10k (10k cells, 61 cell types); MOUSE-13k (13k cells, 37 types) Top-2048 genes; gene descriptions tokenized with GPT-3.5 Gene embeddings projected to 5120 dimensions Llama-13b frozen, MLP projection; class-token outputs Annotate cell types, enhance gene-cell representations Annotate novel cell types, interpret gene-cell relationships UMAP, attention confusion matrix, clustering metrics
MarsGT πŸ’‘ MarsGT: Graph Transformer for Multi-Omics Data Integration in Single-Cell Analysis. Wang et al. Nature Methods (2024) GitHub Repository scRNA-seq, scATAC-seq 550 simulated datasets, 4 human PBMC datasets; species: human, mouse Genes/peaks tokenized by quartile-based accessibility/expression 512-dim embeddings for cells, genes, peaks Heterogeneous Graph Transformer (HGT) with multi-head attention Identify rare/major populations, peak-gene networks Cross-species rare population inference, cancer applications UMAP, pathway enrichment, regulatory network analysis
scCLIP πŸ’‘πŸ” scCLIP: Contrastive Learning Integrates Multi-Omics Single-Cell Data. Zhang et al. bioRxiv (2024) GitHub Repository scATAC-seq, scRNA-seq Fetal atlas (~377k cells), AD brain dataset (~10k cells) ATAC: chromosome-based patches; RNA: genes tokenized as patches Patches embedded via dense layers into shared latent space Dual transformer encoders; cross-modal contrastive learning Joint embedding of ATAC and RNA; cell type integration Atlas-level tissue integration, unseen data predictions UMAP, ARI, NMI, silhouette scores
C.Origami πŸ’‘ Cell type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening. Nature Biotechnology (2023) GitHub Repository DNA sequence, CTCF binding, chromatin accessibility Seven Hi-C datasets (IMR-90, GM12878, H1-hESC, K562, etc.) DNA: one-hot; ATAC/CTCF: dense bigWig profiles Conv1D for DNA and feature encoding Transformer + Conv2D residual decoder Predict Hi-C contact matrices, genome folding features Predict chromatin changes, cis-/trans-regulator perturbations Saliency maps, impact scores (ISGS), attention maps
DeepMAPS πŸ’‘πŸ” DeepMAPS: Deep Learning-Based Multi-Omics Data Integration for Single-Cell Profiling. bioRxiv (2024) GitHub Repository scRNA-seq, scATAC-seq, CITE-seq 10 datasets (3 scRNA, 3 CITE-seq, 4 scMulti-omics); PBMC, lung tumor Cells/genes as graph nodes; edges: gene-cell relations Two-layer GNN-based embeddings iteratively updated Heterogeneous Graph Transformer (HGT) with attention Cell clustering, GRN inference, cell communication GRN prediction across tissues Attention scores, centrality metrics, UMAP
scMVP πŸ’‘ scMVP: Single-Cell Multi-View Representation Learning with Transformer. Nature Methods (2023) GitHub Repository scRNA-seq, scATAC-seq SNARE-seq, sci-CAR, SHARE-seq; human/mouse datasets RNA counts (raw); ATAC TF-IDF transformed 128-dim RNA/ATAC embeddings combined into shared latent space Asymmetric variational autoencoder; multi-head attention Denoise RNA/ATAC; trajectory inference, CRE predictions Predict rare populations, cis-regulatory associations ARI clustering, UMAP, attention-weight visualization
AgroNT πŸ’‘πŸ” A Foundational Large Language Model for Edible Plant Genomes. Javier Mendoza-Revilla et al. Communications Biology (2024) GitHub Repository DNA sequences Pretraining: ~10.5M sequences across 48 plant species; Fine-tuning: 8 tasks Non-overlapping 6-mers (6000 bp chunks, 15% masked for MLM) 1500-dimensional embeddings (token + positional embeddings) Transformer, 40 attention blocks, 1B parameters Predict polyadenylation sites, splicing, chromatin accessibility, tissue-specific expression Functional variant impacts, tissue expression variance Token importance, LLR, in silico mutagenesis
gLM2 πŸ’‘ gLM2: Genomic Language Model for Multi-Task Learning in Genomics. bioRxiv (2024) GitHub Repository DNA sequences OMG dataset: 3.1T bp, 3.3B CDS, 2.8B IGS CDS: amino acids; IGS: nucleotides; strand orientation tokens 640–1280 dimensions, RoPE positional embeddings Transformer-based, SwiGLU layers, FlashAttention-2 Protein-protein interactions, regulatory annotations Binding interface prediction, motif learning Categorical Jacobian, UMAP
MarkerGeneBERT πŸ’‘πŸ” MarkerGeneBERT: A Transformer-Based Model for Single-Cell Marker Gene Identification. bioRxiv (2024) GitHub Repository scRNA-seq 3702 studies; 7901 markers for humans, 8223 for mice Tokenized marker sentences; SciBERT preprocessing Sentence embeddings, SciBERT refinements Transformer-based NLP with SciBERT Extract cell markers, annotate scRNA-seq Predict novel markers, cluster annotation Attention weights, precision-recall
UTR-LMπŸ’‘πŸ” A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions. bioRxiv (2023) GitHub Repository 5β€² UTRs of mRNA 214k UTRs (5 species), 280k synthetic libraries Masked nucleotide prediction 128-dimensional nucleotide embedding Six-layer transformer, 16 attention heads MRL, TE, EL, IRES prediction Luciferase fitness, unseen UTR prediction Motif analysis, UMAP
scGPT πŸ’‘πŸ” scGPT: A Generative Pre-trained Transformer for Single-Cell Omics Data. bioRxiv (2023) GitHub Repository scRNA-seq 33M human cells, 441 studies, 51 tissues/organs Gene expression ranked encoding, metadata tokens 512-dimensional gene-cell embeddings 12-layer transformer, masked multi-head attention Cell type annotation, batch correction Perturbation prediction, multi-omics integration Attention weights, UMAP visualization
THItoGene THItoGene: Integrating Histological Images and Spatial Transcriptomics for Gene Expression Prediction. bioRxiv (2023) GitHub Repository Histological images HER2+ breast cancer (32 sections, 9,612 spots, 785 genes) Spots tokenized via positional encoding; 112Γ—112 patches for histology Dynamic convolution with ViT and GAT integration Hybrid: dynamic convolution, Efficient-CapsNet, ViT, GAT Spatial gene expression patterns, tumor-related gene identification Reconstruct spatial domains, predict enrichment in unseen tissues Attention weights, ARI clustering, Pearson correlation
scTranslator πŸ’‘ scTranslator: A Transformer-Based Model for Single-Cell RNA-Seq Data Integration. bioRxiv (2023) GitHub Repository scRNA-seq Bulk datasets (31 cancer types, 18,227 samples), Single-cell datasets (161,764 PBMCs, 65,698 pan-cancer myeloid cells) Gene IDs via re-indexed GPE; RNA expression values as tokens 128-dim GPE embeddings + RNA embeddings Transformer encoder-decoder, 2 layers, FAVOR+ attention Protein abundance prediction, batch correction, pseudo-knockout analysis Predict missing proteomics, tumor/normal cell origins Attention matrices, pseudo-knockout analysis, ARI clustering
GPN-MSA πŸ’‘πŸ” GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction. bioRxiv (2023) GitHub Repository DNA sequences Whole-genome MSA of 100 vertebrates (~9B variants) One-hot encoding across MSA columns; weighted token sampling Contextual embeddings from MSA 12-layer Transformer with RoFormer; weighted cross-entropy loss Variant deleteriousness scores, novel region annotation Predict deleterious variants, annotate non-coding regions UMAP, phastCons/phyloP correlation, epigenetic enrichment
FloraBERT πŸ’‘πŸ” FloraBERT: cross-species transfer learning with attention-based neural networks for gene expression prediction. Research Square (2022) GitHub Repository Plant DNA sequences ~7.9M plant promoters (93 species); maize fine-tuning (25 genomes, 9 tissues) Byte Pair Encoding (5,000-token vocabulary) 768-dim token + positional embeddings RoBERTa-based Transformer, 6 encoder layers, 6 attention heads Gene expression prediction across tissues Regulatory potential in unseen species, cross-species similarity Positional importance, UMAP embedding visualization, RΒ² metrics
Enformer πŸ’‘πŸ” Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods (2021) GitHub Repository DNA sequences Human genome (34k training, 2k validation), mouse genome (29k training) One-hot nucleotide encoding, spatial positional encodings Convolutional embedding for initial sequence processing 7 convolutional layers + 11 transformer layers Gene expression, enhancer-promoter interactions, variant effects Variant prioritization, enhancer-gene annotation Attention weights, SLDP, gradient Γ— input for impact
CpGPT πŸ’‘πŸ” CpGPT: A Transformer-Based Model for Predicting DNA Methylation States. bioRxiv (2023) GitHub Repository DNA methylation 1,500+ datasets, 100,000+ samples, various tissues and species DNA sequence embeddings, methylation beta values, dual positional encodings Pretrained DNA language model embeddings; epigenetic state embeddings Transformer++ with dual positional encoding Imputation, array conversion, age prediction, mortality prediction, tissue classification Missing data imputation, array conversion, zero-shot reference mapping Attention weights for CpG site importance, UMAP for sample embeddings
Hist2ST πŸ” Hist2ST: Integrating Histology and Spatial Transcriptomics for Spatial Gene Expression Prediction. bioRxiv (2023) GitHub Repository Histology, spatial transcriptomics 8 datasets (HER2+, cSCC, Alzheimer’s, mouse olfactory bulb, etc.) Image patches (Convmixer), positional encodings, graph nodes 1024-dimensional embeddings (Convmixer, Transformer, GNN) Convmixer + Transformer + Graph Neural Network (GNN) Spatial gene expression prediction, clustering, spatial region identification Cross-dataset prediction, annotation transfer Attention maps, ARI, UMAP, Pearson correlation
Precious3GPT πŸ’‘πŸ” Precious3GPT: Multimodal Multi-Species Multi-Omics Multi-Tissue Transformer for Aging Research and Drug Discovery. bioRxiv (2024) Hugging Face Repository Multi-omics (gene expression, DNA methylation, proteomics) 1,500+ datasets, 100,000+ samples, various tissues and species Structured cell sentences (c-sentences) combining gene expression, metadata, and task prompts 360-dimensional embeddings capturing multi-omics context Transformer-based architecture with 89 million parameters Age prediction, target discovery, tissue classification, drug sensitivity prediction Predict biological and phenotypic responses to compound treatments Attention weights, SHAP value feature importance analysis
BioFormers πŸ’‘ BioFormers: A scalable framework for exploring biostates using transformers. Siham Amara-Belgadi et al. bioRxiv (2023) GitHub Repository scRNA-seq, multi-omics PBMC 8k, Perturb-seq datasets (~12k cells, 5k genes); multi-omics data including genomic, proteomic, transcriptomic Biomolecular tokens, value binning for expression levels Transformer-based embeddings; biomolecular and sample embeddings Encoder-only and decoder-only transformer models; self-attention mechanism Cell clustering, masked gene modeling, GRN inference, genetic perturbation prediction Zero-shot cell type discovery, cross-species transfer learning Attention maps, gene embeddings, cosine similarity, CHIP-Atlas validation
Transformer DeepLncLoc πŸ’‘πŸ” DeepLncLoc: A deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding. Min Zeng et al. Briefings in Bioinformatics (2022) GitHub Repository lncRNA sequences RNALocate database; 857 samples, 5 subcellular localizations (cytoplasm, nucleus, ribosome, cytosol, exosome) Subsequence embedding using k-mer splitting; Word2Vec TextCNN for high-level feature extraction TextCNN with subsequence embedding and pooling layers Subcellular localization prediction for lncRNAs Standalone generalization to new species Attention visualization, feature comparisons
EPBDxDNABERT-2 πŸ’‘πŸ” DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors. Anowarul Kabir et al. Nucleic Acids Research (2024) Not available Genomic DNA sequences 690 ChIP-seq experiments (161 transcription factors, 91 human cell types); HT-SELEX data (215 TFs, 27 families) Byte Pair Encoding (BPE) for genomic sequences; flanking region integration Transformer embeddings; EPBD features for DNA breathing Transformer architecture with cross-attention integration of DNABERT-2 and EPBD dynamics Predict TF-DNA binding affinity, motif discovery, and binding response to mutations Cross-species binding prediction, interpretability via cross-attention weights Cross-attention heatmaps, motif validation via JASPAR database
Evo πŸ’‘πŸ” Sequence modeling and design from molecular to genome scale with Evo. Eric Nguyen et al. Science (2024) Not available Genomic DNA, RNA, and protein sequences 2.7 million prokaryotic and phage genomes (~300 billion nucleotides) Single-nucleotide byte-level tokenization StripedHyena hybrid embeddings; 7 billion parameters; 131k token context StripedHyena architecture with convolutional and attention layers Predict fitness effects of mutations, functional CRISPR-Cas systems, transposon generation Cross-species functional prediction, genome-scale design Positional entropy, structure prediction, TUD clustering
GeneBERT πŸ’‘πŸ” Multi-modal self-supervised pre-training for regulatory genome across cell types. Shentong Mo et al. arXiv (2021) Not available Genomic DNA sequences, transcription factor binding matrices ATAC-seq data, 17 million sequences, 17 cell types k-mer tokenization (3-6mers); transcription factor binding matrices BERT-based embeddings for sequences, Swin transformer for regions Transformer-based model combining sequence and region representations Promoter classification, TFBS prediction, disease risk estimation, RNA splicing site prediction Cross-cell type prediction of regulatory elements Attention maps, t-SNE visualizations, ablation studies
GeneCompass πŸ’‘πŸ” GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. Xiaodong Yang et al. Cell Research (2024) Not available scRNA-seq, multi-omics scCompass-126M corpus with 120M+ single-cell transcriptomes from human and mouse; 101.76M cells post-filtering Ranked 2048-gene tokens; prior knowledge integration with GRN, promoter, gene families, and co-expression 12-layer transformer, 768-dimensional embeddings; species token prepending Transformer architecture with self-attention and masked language modeling Cell type annotation, GRN inference, drug response prediction, perturbation effects, cell fate predictions Cross-species cell annotation, regulatory network predictions Attention maps, cosine similarity, embedding space analysis
LangCell πŸ’‘πŸ” LangCell: Language-Cell Pre-training for Cell Identity Understanding. Suyuan Zhao et al. Proceedings of the 41st International Conference on Machine Learning (2024) GitHub Repository scRNA-seq, multi-modal data 27.5M scRNA-seq samples, human cells with metadata from CELLxGENE Rank value encoding; textual descriptions generated from OBO Foundry Geneformer-based embeddings; BERT-based text encoder Multi-task transformer model with contrastive learning and cross-attention Cell type annotation, pathway identification, batch effect correction, novel disease-related tasks Zero-shot cell type annotation, cross-type cell-text retrieval UMAP visualizations, cross-attention scores, ablation studies
MOT πŸ’‘πŸ” MOT: A Multi-Omics Transformer for Multiclass Classification Tumour Types Predictions. Mazid Abiodoun Osseni et al. BIOSTEC Proceedings (2023) GitHub Repository Multi-omics (mRNA, miRNA, DNA methylation, CNVs, proteomics) TCGA Pan-Cancer dataset (33 cancer types, 5 omics, imbalanced samples) Per-omic tokenization with MAD and mutual info for feature selection Embeddings with multi-head attention for omics integration Transformer encoder-decoder without positional encoding Tumor type classification, robustness to missing omics views Cross-omics classification, interpretability of omic contributions Attention heatmaps, omics impact analysis via ablation
MuSe-GNN πŸ’‘πŸ” MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data. Tianyu Liu et al. NeurIPS (2023) GitHub Repository scRNA-seq, spatial data, scATAC-seq 82 datasets across 10 tissues, 3 sequencing techniques, 3 species HVG filtering, scTransform, SPARK-X; multimodal graph co-expression Graph embeddings with TransformerConv layers; weight-sharing GNNs Cross-graph Transformer integrating contrastive and similarity learning Gene embeddings for functional similarity, pathway enrichment, GRN inference, disease analysis Cross-species functional predictions, COVID and cancer gene analyses UMAPs, causal network analysis, GOEA, IPA
Pathformer πŸ’‘πŸ” Pathformer: A biological pathway-informed transformer for disease diagnosis and prognosis using multi-omics data. Xiaofan Liu et al. Bioinformatics (2024) GitHub Repository Multi-omics (RNA expression, DNA methylation, CNVs, splicing, editing) TCGA (33 cancer types), plasma cfRNA, platelet RNA datasets; 10 tissue and liquid biopsy datasets Multi-modal vector embedding at gene level, pathway sparse neural network Pathway embeddings updated via criss-cross attention Transformer with crosstalk-aware attention, sparse NN for pathway integration Cancer diagnosis, stage prediction, drug response, survival prognosis Cross-modal cancer screening, pathway-level interpretability SHAP values, attention maps, crosstalk network visualization
RhoFold+ πŸ’‘πŸ” Accurate RNA 3D structure prediction using a language model-based deep learning approach. Tao Shen et al. Nature Methods (2024) Not available RNA sequences 23.7M RNA sequences, 800k species, 5,583 chains; RNA-Puzzles, CASP15 datasets RNA-specific tokenization with MSA embeddings Rhoformer transformer with IPA for geometry-aware embeddings Transformer-based architecture with secondary and tertiary structural constraints RNA 3D structure prediction, secondary structure inference, interhelical angle calculation Cross-type RNA predictions, artifact corrections, construct engineering Attention maps, IHAD (interhelical angle difference), RMSD analysis
SATURN πŸ’‘πŸ” Toward universal cell embeddings: integrating single-cell RNA-seq datasets across species with SATURN. Yanay Rosen et al. Nature Methods (2024) Not available scRNA-seq, protein sequences 335,000 cells from 3 species (Tabula Sapiens, Tabula Microcebus, Tabula Muris), 97,000 frog cells, 63,000 zebrafish cells k-means clustering of protein embeddings into macrogenes Macrogene-based embeddings derived from protein language models Pretrained autoencoder with ZINB loss, fine-tuned using triplet margin loss Cross-species dataset integration, differential macrogene expression, species-specific cell type discovery Zero-shot cross-species annotation, integration of remote evolutionary datasets UMAP visualization, GO term enrichment, protein embedding analysis
scELMo πŸ’‘πŸ” scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis. Tianyu Liu et al. bioRxiv (2024) Not available scRNA-seq, multi-omics 20 datasets across scRNA-seq, proteomics, and multi-omics data; diverse species Text embeddings from GPT-3.5 metadata summaries; weighted average and arithmetic mean cell embeddings Lightweight neural networks; contrastive learning for task-specific fine-tuning Zero-shot framework with embeddings and fine-tuning for diverse tasks Cell clustering, batch effect correction, cell-type annotation, in-silico treatment analysis Cross-dataset integration, perturbation prediction UMAP visualizations, cosine similarity, pathway enrichment (GOEA, IPA)
scLong πŸ’‘ scLong: A Billion-Parameter Foundation Model for Capturing Long-Range Gene Context in Single-Cell Transcriptomics. Ding Bai et al. bioRxiv (2024) Not available scRNA-seq, multi-omics 48M cells, 27,874 genes from 1,618 datasets, covering diverse tissues and cell types Full transcriptome self-attention; Gene Ontology integration with GCNs Dual encoder for high- and low-expression genes; contextual representations via Performer Transformer with self-attention, graph convolution for gene knowledge integration Gene regulatory network inference, transcriptional response prediction, drug synergy analysis Cross-species gene annotations, transcriptional shifts prediction Attention maps, hierarchical clustering, GO-based feature analysis
scMoFormer πŸ’‘πŸ” Single-Cell Multimodal Prediction via Transformers. Wenzhuo Tang et al. CIKM (2023) GitHub Repository scRNA-seq, surface protein data NeurIPS 2021 and 2022 competition datasets (GEX2ADT, CITE-seq); CBMC dataset Graph construction with STRING database; SVD for RNA denoising Multimodal transformers and graph-based embeddings Cell, gene, and protein transformers with graph-based cross-modality aggregation Surface protein abundance prediction, multimodal integration Generalization to unseen modalities and datasets Attention maps, RMSE, MAE, Pearson correlation coefficient
SpatialDiffusion πŸ’‘ SpatialDiffusion: Predicting Spatial Transcriptomics with Denoising Diffusion Probabilistic Models. Sumeer Ahmad Khan et al. bioRxiv (2024) Not available Spatial transcriptomics MERFISH (12 slices, mouse hypothalamic preoptic region, ~73,655 spots, 161 genes); Starmap (mouse visual cortex, 984 spots, 1,020 genes); DLPFC (human dorsolateral prefrontal cortex, ~3,431 spots, 3,000 genes) Embedding and linear transformations of spatial and cell-type features Diffusion embeddings for spatial relationships; contextualized latent representations Denoising Diffusion Probabilistic Model (DDPM) with enhanced embeddings In silico slice interpolation, transcriptomic profile reconstruction Cross-slice interpolation, structure preservation across regions Spearman correlation, neighborhood enrichment, normalized MSE
TransformerST πŸ’‘πŸ” Innovative super-resolution in spatial transcriptomics: a transformer model exploiting histology images and spatial gene expression. Chongyue Zhao et al. Briefings in Bioinformatics (2024) GitHub Repository Spatial transcriptomics, histology images Human dorsolateral prefrontal cortex (LIBD), melanoma, IDC (HER+ breast cancer), mouse lung tissues Spot-centric and sliding-window patch extraction; positional encodings Vision Transformer for image patches; Graph Transformer for spatial embeddings Cross-scale graph network for super-resolution; adaptive graph transformer for clustering Tissue identification, gene expression reconstruction at single-cell resolution Super-resolution without scRNA-seq references; cross-platform adaptability Adjusted Rand Index (ARI), clustering accuracy, UMAP visualizations
UCE πŸ’‘πŸ” Universal Cell Embedding: A Foundation Model for Cell Biology. Yanay Rosen et al. bioRxiv (2024) Not available scRNA-seq, protein sequences 36 million cells, 1,000+ cell types, 300 datasets, 50 tissues, 8 species (e.g., human, mouse, zebrafish) Protein embeddings with ESM2, expression-weighted sampling Transformer-based embeddings with 33 layers and 650M parameters Transformer architecture integrating protein and expression data Zero-shot cell type prediction, dataset integration, species-level gene alignment Cross-species embedding, atlas-scale cell annotation, disease cell mapping UMAP visualizations, silhouette width, adjusted Rand Index
scMulan πŸ’‘πŸ” scMulan: A Multitask Generative Pre-Trained Language Model for Single-Cell Analysis. Haiyang Bian et al. Research in Computational Molecular Biology (RECOMB) 2024 GitHub scRNA-seq, multi-omics hECA-10M (~10 million human single cells); 42,117 genes with meta-attributes Unified c-sentences encoding meta-attributes and expression levels Transformer decoder with shuffled token embeddings Generative pretraining using masked c-sentences; 368M parameters Cell type annotation, batch integration, conditional cell generation Zero-shot cell type annotation, batch integration, conditional cell generation UMAP visualizations, pseudo-time embeddings, cosine similarity
Geneformer πŸ’‘πŸ” Transfer learning enables predictions in network biology. Christina V. Theodoris et al. Nature (2023) Hugging Face Repository; GitHub Repository scRNA-seq Genecorpus-30M (29.9M human single-cell transcriptomes); 561 datasets, diverse tissues Rank value encoding of transcriptomes; context-aware self-attention Transformer encoder (6 layers, 4 attention heads, 256 dimensions) Pretrained transformer for contextual embeddings, fine-tuned for network biology tasks Gene dosage prediction, chromatin dynamics, cell type annotations, disease modeling Context-aware predictions for rare diseases, cross-tissue integration Attention maps, in silico perturbation, embedding space clustering

Benchmarking Papers

πŸ“„ Paper πŸ’» Code 🧠 Benchmarking Models 🌟 Main Focus πŸ“ Results & Insights

Review/Perspective Papers

πŸ“„ Paper 🌟 Highlights/Main Focus πŸ“ Remarks & Conclusion

About

A curated repository designed to serve as a comprehensive guide for researchers interested in the intersection of Transformer models and genomics. This repository compiles key academic papers that demonstrate the application of transformer-based models in genomics, providing users with a valuable resource to navigate this rapidly evolving field.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages