Skip to content
/ RESM Public

This is the official codebase for RESM: Capturing sequence and structure encoding of RNAs by mapped transfer learning from ESM (evolutionary scale modeling) protein language model

License

Notifications You must be signed in to change notification settings

yikunpku/RESM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 RESM: RNA Evolution-Scale Modeling

Paper License: MIT Python 3.8+ PyTorch

πŸ“£ News

  • [2025/07] πŸŽ‰ We release the dataset and checkpoints on Zenodo!
  • [2025/07] πŸ“Š Initial release the code of RESM-150M and RESM-650M models with comprehensive documentation.

⚑ Overview

RESM (RNA Evolution-Scale Modeling) is a state-of-the-art RNA language model that leverages protein language model knowledge to overcome RNA's inherent challenges. By mapping RNA sequences to pseudo-protein representations and adapting the ESM2 protein language model, RESM provides a robust foundation for deciphering RNA sequence-structure-function relationships.

Key Features:

  • Pseudo-protein Mapping: Novel approach to convert RNA's 4-letter alphabet into protein-like representations
  • Knowledge Transfer: Leverages the powerful representations learned by ESM protein language models
  • Dual-task Excellence: First RNA model to achieve state-of-the-art performance on both structural and functional prediction tasks
  • Zero-shot Capability: Outperforms 12 RNA language models in zero-shot evaluation without task-specific training
  • Benchmark Performance: Demonstrates superior results across 8 downstream tasks, surpassing 60+ models
  • Long RNA Breakthrough: 81% accuracy gain and 1000Γ— speedup on sequences up to 4,000 nucleotides
  • Flexible Architecture: Available in 150M and 650M parameter versions

πŸ“₯ Download URL

Resource Description Size Link
Datasets pre-training and downstream datasets ~6.4GB Download
RESM-150M model checkpoint ~1.8GB Download
RESM-650M model checkpoint ~2.6GB Download

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • PyTorch 1.10+
  • CUDA 11.0+ (for GPU support)

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/RESM.git
cd RESM
  1. Create and activate conda environment:
# Create conda environment from yml file
conda env create -f environment.yml

# Activate the environment
conda activate resm

πŸ“Š Usage

Feature Extraction

Extract RNA embeddings and attention maps from your RNA sequences:

# For RESM-150M model (default paths)
python resm_inference.py \
    --base_model RESM_150M \
    --data_path /path/to/your/data \
    --output_dir /path/to/output \
    --device cuda

# For RESM-650M model (default paths)
python resm_inference.py \
    --base_model RESM_650M \
    --data_path /path/to/your/data \
    --output_dir /path/to/output \
    --device cuda

# Use custom checkpoint path
python resm_inference.py \
    --base_model RESM_150M \
    --model_path /path/to/custom/checkpoint.ckpt \
    --data_path /path/to/your/data \
    --output_dir /path/to/output \
    --device cuda

Input Data Format

The model expects RNA sequences in FASTA format or as a text file with RNA IDs. Place your data in the following structure:

data/
β”œβ”€β”€ dsdata/
β”‚   β”œβ”€β”€ msa/          # MSA files (optional, can use single sequences)
β”‚   └── extract_ss_data_alphaid.txt  # List of RNA IDs

Output Format

The model outputs two types of features for each RNA sequence:

  1. Embeddings (*_emb.npy):

    • RESM-150M: Shape (L, 640) where L is sequence length
    • RESM-650M: Shape (L, 1280) where L is sequence length
  2. Attention Maps (*_atp.npy):

    • RESM-150M: Shape (600, L, L) (30 layers Γ— 20 heads)
    • RESM-650M: Shape (660, L, L) (33 layers Γ— 20 heads)

πŸ—οΈ Model Architecture

RESM builds upon ESM2 architecture with RNA-specific adaptations:

RESM-150M (Based on ESM2-150M)

  • Base Model: esm2_t30_150M_UR50D
  • Layers: 30 transformer layers
  • Embedding Dimension: 640
  • Attention Heads: 20
  • Parameters: ~150M

RESM-650M (Based on ESM2-650M)

  • Base Model: esm2_t33_650M_UR50D
  • Layers: 33 transformer layers
  • Embedding Dimension: 1280
  • Attention Heads: 20
  • Parameters: ~650M

πŸ” Example Use Cases

  1. RNA Secondary Structure Prediction: Use extracted attention maps for predicting RNA base pairs with state-of-the-art accuracy
  2. RNA Function Classification: Leverage embeddings for functional annotation of novel RNA sequences
  3. Gene Expression Prediction: Apply RESM features for mRNA expression level prediction
  4. Ribosome Loading Efficiency: Predict translation efficiency from mRNA sequences
  5. RNA Similarity Search: Compare RNA sequences using embedding similarity
  6. Transfer Learning: Fine-tune on your specific RNA task for enhanced performance

πŸ“ Citation

If you use RESM in your research, please cite our paper:

@article {Zhang2025.08.09.669469,
	author = {Zhang, Yikun and Zhang, Hao and Li, Guo-Wei and Wang, He and Zhang, Xing and Hong, Xu and Zhang, Tingting and Wen, Liangsheng and Zhao, Yu and Jiang, Jiuhong and Chen, Jie and Chen, Yanjun and Liu, Liwei and Zhan, Jian and Zhou, Yaoqi},
	title = {RESM: Capturing sequence and structure encoding of RNAs by mapped transfer learning from ESM (evolutionary scale modeling) protein language model},
	elocation-id = {2025.08.09.669469},
	year = {2025},
	doi = {10.1101/2025.08.09.669469},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {RNA sequences exhibit lower evolutionary conservation than proteins due to their informationally constrained four-letter alphabet, compared to the 20-letter code of proteins. More limited information makes unsupervised learning of structural and functional evolutionary patterns more challenging from single RNA sequences. We overcame this limitation by mapping RNA sequences to pseudo-protein sequences to allow effective transfer training from a protein language model (protein Evolution-Scale Model 2, protESM-2). The resulting RNA ESM (RESM) outperforms 12 existing RNA language models in zero-shot prediction, not only in sequence classification but also in RNA secondary structure and RNA-RNA interaction prediction. Further supervised fine-tuning demonstrates RESM{\textquoteright}s generalizability and superior performance over the existing models compared across multiple downstream tasks, including mRNA ribosome loading efficiency and gene expression prediction, despite RESM being trained exclusively on noncoding RNAs. Moreover, RESM can generalize to unseen sequences beyond its 1,024-nucleotide training limit, achieving 81.3\% improvement over state-of-the-art methods in supervised secondary structure prediction for RNAs up to 4,000 nucleotides, limited only by the available GPU memory, while providing \>1000-fold speedup compared to MSA-based approaches. RESM provides a robust foundation for deciphering RNA sequence-structure-function relationships, with broad implications for RNA biology.Competing Interest StatementPatent applications related to RESM and downstream tasks were submitted by China Mobile Research Institute and Shenzhen Bay Laboratory. LW,YC, \& TZ are affiliated with China Mobile Research Institute. YiZ, HW, JZ, \& YaZ are affiliated with Shenzhen Bay Laboratory. JZ and YaZ are the CEO and the chair of the scientific advisory board for Ribopeutic, respectively. All other authors declare no competing interests.},
	URL = {https://www.biorxiv.org/content/early/2025/08/10/2025.08.09.669469},
	eprint = {https://www.biorxiv.org/content/early/2025/08/10/2025.08.09.669469.full.pdf},
	journal = {bioRxiv}
}

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘ Acknowledgments

🀝 Contributing

We welcome contributions! Please feel free to submit issues or pull requests.

πŸ“§ Contact

For questions or collaborations, please contact: yikun.zhang@stu.pku.edu.cn

About

This is the official codebase for RESM: Capturing sequence and structure encoding of RNAs by mapped transfer learning from ESM (evolutionary scale modeling) protein language model

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages