An official PyTorch implementation of the paper Beyond Sequence: Impact of Geometric Context for RNA Property Prediction.
- We introduce a diverse collection of RNA datasets, including newly annotated 2D and 3D structures, covering various prediction tasks at nucleotide and sequence levels across multiple species.
- We provide a unified testing environment to evaluate different types of machine learning models for RNA property prediction, including sequence models for 1D, graph neural networks for 2D, and equivariant geometric networks for 3D RNA representations.
- We conduct a comprehensive analysis of how different models perform under various conditions, such as limited data and labels, different types of sequencing errors, and out-of-distribution scenarios. We highlight the trade-offs and contexts in which each modeling approach is most effective, guiding researchers in selecting suitable models for specific RNA analysis challenges.
- We also introduce novel modifications to existing 3D geometric models based on biological prior, specifically optimizing them for handling large-scale point cloud RNA data, thus improving the efficiency and performance of 3D models significantly.
Optional 1: Install dependencies manually. Create a new virtual environment.
conda create --name rna python=3.11
conda activate rna
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.3.0+cu121.html
pip install biopython rdkit optuna gdown pandas
Optional 2: Install dependencies with conda environment.
conda env create -f environment.yml
Data can be downloaded from the original sources and preprocessed as explained in rnadatasets:
For running files ending with _1d.py
, _1d2d.py
, _2d.py
, _3d.py
, they are for 1D, 1D2D, 2D, 3D data, respectively. The corresponding relationship is summarized in the following table.
Dimension | Method |
---|---|
run_1d.py |
Transformer1D |
run_1d2d.py |
Transformer1D2D |
run_2d.py |
GCN, GAT, ChebNet, GraphGPS, GraphTransformer |
run_3d.py |
EGNN, SchNet |
We provide running scripts and all optimized hyperparameters (as listed in Table 5 of the paper) in the scripts.sh
file.
Take running Transformer1D on CovidVaccine1D for example:
python run_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8
Replace the dataset in 3.1 with the corresponding noisy dataset, with a certain noise level. For example, run Transformer1D on CovidVaccine1DNoisy with noise level 0.1:
python run_1d.py --model tsfm --dataset CovidVaccine1DNoisy --noise_level 0.1 --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8
Replace the file name in 3.1 with clean_noisy
. It gives the results on 6 different noisy datasets, respectively. For example, run Transformer1D on CovidVaccine1D:
python clean_noisy_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8
Add argument --train_ratio
to control the ratio of training data. For example, run Transformer1D on CovidVaccine1D with 50% training data:
python run_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8 --train_ratio 0.2 --val_ratio 0.1 --test_ratio 0.1
Note:
- for Tc-riboswitches and Fungal datasets, we follow the original paper to use fixed splits. Therefore, this will only vary the training data, while the validation and test data remain the same.
- Here the
train_ratio
is the ratio of all data, including training, validation and test data.
Run label_ratio
with argument --seq_label_ratio
to control the fraction of sequence labelling. For example, run Transformer1D on CovidVaccine1D with 20% sequence labelling:
python label_ratio_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8 --seq_label_ratio 0.2
If you encounter any problems, please file an issue along with a detailed description.
If you found this work useful, please consider citing
@inproceedings{xu2024beyond,
title={Beyond Sequence: Impact of Geometric Context for RNA Property Prediction},
author={Xu, Junjie and Moskalev, Artem and Mansi, Tommaso and Prakash, Mangal and Liao, Rui},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=9htTvHkUhh}
}
@inproceedings{xu2025harmony,
title={{HARMONY}: A Multi-Representation Framework for {RNA} Property Prediction},
author={Junjie Xu and Artem Moskalev and Tommaso Mansi and Mangal Prakash and Rui Liao},
booktitle={ICLR 2025 Workshop on AI for Nucleic Acids},
year={2025},
url={https://openreview.net/forum?id=nzUsRhtnBa}
}