Skip to content

MarksLab-DasLab/RNAGym

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RNAGym

Overview

RNAGym is an extensive benchmark suite for RNA fitness and structure prediction. This code repository provides unified access to all baselines leveraged in our paper, as well as to the underlying datasets used to assess their respective fitness and/or structure prediction performance.

Baselines

Fitness prediction

We have currently implemented the following baselines (see ./fitness):

Model name Model type Reference
Evo 1 Genomic Language Model Nguyen et al. "Sequence modeling and design from molecular to genome scale with Evo". Science, 2024.
Evo 2 Genomic Language Model Brixi et al. "Genome modeling and design across all domains of life with Evo 2". bioRxiv, 2025.
RNAErnie Genomic Language Model Wang et al. "Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning". Nature Machine Intelligence, 2024.
RiNALMo RNA Language Model Penić et al. "RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks". arXiv, 2024.
RNA-FM RNA Language Model Chen et al. "Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions" arXiv, 2022.
Nucleotide Transformer Genomic Language Model Dalla-Torre et al. "The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics". Nature Methods, 2024.
GenSLM Genomic Language Model Zvyagin et al. "GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics". The International Journal of High Performance Computing Applications, 2023.

Secondary structure prediction

We have currently implemented the following baselines (see ./2d).

Model name Model type Reference
Ribonanzanet Structure prediction He et al. "Ribonanza: deep learning of rna structure through dual crowdsourcing". bioRxiv, 2024.
EternaFold Structure prediction Wayment-Steele et al. "RNA secondary structure packages evaluated and improved by high-throughput experiments." Nature Methods, 2022.
CONTRAfold Structure prediction Do et al. "CONTRAfold: RNA secondary structure prediction without physics-based models" Bioinformatics, 22 14:e90–8, 2006.
Vienna Structure prediction Gruber et al. "The vienna rna websuite." Nucleic Acids Research, 36:W70 – W74, 2008.
RNAstructure Structure prediction Reuter et al. "Rnastructure: software for rna secondary structure prediction and analysis". BMC Bioinformatics, 11:129 – 129, 2010.
RNA-FM RNA Language Model Chen et al. "Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions" arXiv, 2022.
UFold Structure prediction Fu et al. "UFold: fast and accurate RNA secondary structure prediction with deep learning" Nucleic Acids Research, 2022.
Mxfold2 Structure prediction Sato et al. "RNA secondary structure prediction using deep learning with thermodynamic integration" Nature Methods, 2021.

Tertiary structure prediction

We have also developed a data curation pipeline, structure datasets, and implemented several baselines for RNA tertiary structure (see ./3d).

Model name Model type Reference
AlphaFold3 Structure prediction Abramson et al. "Accurate structure prediction of biomolecular interactions with AlphaFold 3" Nature, 2024.
NuFold Structure prediction Kagaya et al. "NuFold: end-to-end approach for RNA tertiary structure prediction with flexible nucleobase center representation" Nature Communications, 2025.
RhoFold+ Structure prediction Shen et al. "Accurate RNA 3D structure prediction using a language model-based deep learning approach" Nature Methods, 2024.
RoseTTAFoldNA Structure prediction Baek et al. "Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA" Nature Methods, 2023.
trRosettaRNA Structure prediction Wang et al. "trRosettaRNA: automated prediction of RNA 3D structure with transformer network" Nature Communications, 2023.

Setup

The RNAGym environment may be created via conda and the provided rnagym_env.yml file as follows:

conda env create -f rnagym_env.yml
conda activate rnagym_env

For the fitness prediction task, we recommend the following folder structure:

fitness_prediction/
├── processed_DMS_files/
├── model_predictions/
└── model_checkpoints/

For the structure prediction task, the data processing and scoring scripts expect the following folder structure:

structure_prediction/
├── test_data/
├── raw_data/
├── model_predictions/
└── models/

The content for models, model_predictions, raw_data and test_data may all be downloaded via the links in the next section. The data_folder argument in the data and scoring scripts should be set to the location of the structure_prediction folder.

Resources

To download and unzip the data, run the following commands for each of the data sources you would like to download, as listed in the table below. For example, you can download & unzip the zero-shot predictions for all baselines for all DMS substitution assays as follows:

curl -o rnagym_assays.zip https://marks.hms.harvard.edu/rnagym/fitness_prediction/rnagym_assays.zip
unzip rnagym_assays.zip && rm rnagym_assays.zip
Task Data Size (unzipped) Link
Fitness prediction Processed assay data 73MB https://marks.hms.harvard.edu/rnagym/fitness_prediction/fitness_processed_assays.zip
Fitness prediction Raw assay data 88MB https://marks.hms.harvard.edu/rnagym/fitness_prediction/fitness_raw_data.zip
Fitness prediction Model predictions 438MB https://marks.hms.harvard.edu/rnagym/fitness_prediction/model_predictions.zip
Fitness prediction Alignments 319K https://marks.hms.harvard.edu/rnagym/fitness_prediction/fitness_MSAs.zip
Fitness prediction 3D structures 621K https://marks.hms.harvard.edu/rnagym/fitness_prediction/fitness_assays_3D_structures.zip
Fitness prediction CV splits (supervised) 145M https://marks.hms.harvard.edu/rnagym/fitness_prediction/fitness_CV_splits.zip
Structure prediction Processed eval data 3.2GB https://marks.hms.harvard.edu/rnagym/structure_prediction/test_data.zip
Structure prediction Raw assay data 5.1GB https://marks.hms.harvard.edu/rnagym/structure_prediction/raw_data.zip
Structure prediction Model predictions 34GB https://marks.hms.harvard.edu/rnagym/structure_prediction/model_predictions.zip
Structure prediction Model files 441MB https://marks.hms.harvard.edu/rnagym/structure_prediction/models.zip
Structure prediction Training data (supervised) 8.1GB https://marks.hms.harvard.edu/rnagym/structure_prediction/train_data.zip
Structure prediction Additional annotations (PDB, Rfam, PseudoBase) 29M https://marks.hms.harvard.edu/rnagym/structure_prediction/test_sequences_annotated.zip

Model files for the structure prediction task were prepared for a linux 64-bit system. Refer to the Arnie repo for different systems.

Data is also available on our HuggingFace.

Acknowledgements

Our codebase leveraged code from the following repositories to compute baselines:

Model Repo
arnie https://github.com/DasLab/arnie

License

This project is available under the MIT license found in the LICENSE file in this GitHub repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 9