- [2025/07] π We release the dataset and checkpoints on Zenodo!
- [2025/07] π Initial release the code of RESM-150M and RESM-650M models with comprehensive documentation.
RESM (RNA Evolution-Scale Modeling) is a state-of-the-art RNA language model that leverages protein language model knowledge to overcome RNA's inherent challenges. By mapping RNA sequences to pseudo-protein representations and adapting the ESM2 protein language model, RESM provides a robust foundation for deciphering RNA sequence-structure-function relationships.
- Pseudo-protein Mapping: Novel approach to convert RNA's 4-letter alphabet into protein-like representations
- Knowledge Transfer: Leverages the powerful representations learned by ESM protein language models
- Dual-task Excellence: First RNA model to achieve state-of-the-art performance on both structural and functional prediction tasks
- Zero-shot Capability: Outperforms 12 RNA language models in zero-shot evaluation without task-specific training
- Benchmark Performance: Demonstrates superior results across 8 downstream tasks, surpassing 60+ models
- Long RNA Breakthrough: 81% accuracy gain and 1000Γ speedup on sequences up to 4,000 nucleotides
- Flexible Architecture: Available in 150M and 650M parameter versions
Resource | Description | Size | Link |
---|---|---|---|
Datasets | pre-training and downstream datasets | ~6.4GB | Download |
RESM-150M | model checkpoint | ~1.8GB | Download |
RESM-650M | model checkpoint | ~2.6GB | Download |
- Python 3.8+
- PyTorch 1.10+
- CUDA 11.0+ (for GPU support)
- Clone the repository:
git clone https://github.com/yourusername/RESM.git
cd RESM
- Create and activate conda environment:
# Create conda environment from yml file
conda env create -f environment.yml
# Activate the environment
conda activate resm
Extract RNA embeddings and attention maps from your RNA sequences:
# For RESM-150M model (default paths)
python resm_inference.py \
--base_model RESM_150M \
--data_path /path/to/your/data \
--output_dir /path/to/output \
--device cuda
# For RESM-650M model (default paths)
python resm_inference.py \
--base_model RESM_650M \
--data_path /path/to/your/data \
--output_dir /path/to/output \
--device cuda
# Use custom checkpoint path
python resm_inference.py \
--base_model RESM_150M \
--model_path /path/to/custom/checkpoint.ckpt \
--data_path /path/to/your/data \
--output_dir /path/to/output \
--device cuda
The model expects RNA sequences in FASTA format or as a text file with RNA IDs. Place your data in the following structure:
data/
βββ dsdata/
β βββ msa/ # MSA files (optional, can use single sequences)
β βββ extract_ss_data_alphaid.txt # List of RNA IDs
The model outputs two types of features for each RNA sequence:
-
Embeddings (
*_emb.npy
):- RESM-150M: Shape
(L, 640)
where L is sequence length - RESM-650M: Shape
(L, 1280)
where L is sequence length
- RESM-150M: Shape
-
Attention Maps (
*_atp.npy
):- RESM-150M: Shape
(600, L, L)
(30 layers Γ 20 heads) - RESM-650M: Shape
(660, L, L)
(33 layers Γ 20 heads)
- RESM-150M: Shape
RESM builds upon ESM2 architecture with RNA-specific adaptations:
- Base Model:
esm2_t30_150M_UR50D
- Layers: 30 transformer layers
- Embedding Dimension: 640
- Attention Heads: 20
- Parameters: ~150M
- Base Model:
esm2_t33_650M_UR50D
- Layers: 33 transformer layers
- Embedding Dimension: 1280
- Attention Heads: 20
- Parameters: ~650M
- RNA Secondary Structure Prediction: Use extracted attention maps for predicting RNA base pairs with state-of-the-art accuracy
- RNA Function Classification: Leverage embeddings for functional annotation of novel RNA sequences
- Gene Expression Prediction: Apply RESM features for mRNA expression level prediction
- Ribosome Loading Efficiency: Predict translation efficiency from mRNA sequences
- RNA Similarity Search: Compare RNA sequences using embedding similarity
- Transfer Learning: Fine-tune on your specific RNA task for enhanced performance
If you use RESM in your research, please cite our paper:
@article {Zhang2025.08.09.669469,
author = {Zhang, Yikun and Zhang, Hao and Li, Guo-Wei and Wang, He and Zhang, Xing and Hong, Xu and Zhang, Tingting and Wen, Liangsheng and Zhao, Yu and Jiang, Jiuhong and Chen, Jie and Chen, Yanjun and Liu, Liwei and Zhan, Jian and Zhou, Yaoqi},
title = {RESM: Capturing sequence and structure encoding of RNAs by mapped transfer learning from ESM (evolutionary scale modeling) protein language model},
elocation-id = {2025.08.09.669469},
year = {2025},
doi = {10.1101/2025.08.09.669469},
publisher = {Cold Spring Harbor Laboratory},
abstract = {RNA sequences exhibit lower evolutionary conservation than proteins due to their informationally constrained four-letter alphabet, compared to the 20-letter code of proteins. More limited information makes unsupervised learning of structural and functional evolutionary patterns more challenging from single RNA sequences. We overcame this limitation by mapping RNA sequences to pseudo-protein sequences to allow effective transfer training from a protein language model (protein Evolution-Scale Model 2, protESM-2). The resulting RNA ESM (RESM) outperforms 12 existing RNA language models in zero-shot prediction, not only in sequence classification but also in RNA secondary structure and RNA-RNA interaction prediction. Further supervised fine-tuning demonstrates RESM{\textquoteright}s generalizability and superior performance over the existing models compared across multiple downstream tasks, including mRNA ribosome loading efficiency and gene expression prediction, despite RESM being trained exclusively on noncoding RNAs. Moreover, RESM can generalize to unseen sequences beyond its 1,024-nucleotide training limit, achieving 81.3\% improvement over state-of-the-art methods in supervised secondary structure prediction for RNAs up to 4,000 nucleotides, limited only by the available GPU memory, while providing \>1000-fold speedup compared to MSA-based approaches. RESM provides a robust foundation for deciphering RNA sequence-structure-function relationships, with broad implications for RNA biology.Competing Interest StatementPatent applications related to RESM and downstream tasks were submitted by China Mobile Research Institute and Shenzhen Bay Laboratory. LW,YC, \& TZ are affiliated with China Mobile Research Institute. YiZ, HW, JZ, \& YaZ are affiliated with Shenzhen Bay Laboratory. JZ and YaZ are the CEO and the chair of the scientific advisory board for Ribopeutic, respectively. All other authors declare no competing interests.},
URL = {https://www.biorxiv.org/content/early/2025/08/10/2025.08.09.669469},
eprint = {https://www.biorxiv.org/content/early/2025/08/10/2025.08.09.669469.full.pdf},
journal = {bioRxiv}
}
This project is licensed under the MIT License - see the LICENSE file for details.
- ESM models The codebase we built upon.
We welcome contributions! Please feel free to submit issues or pull requests.
For questions or collaborations, please contact: yikun.zhang@stu.pku.edu.cn