Beyond Sequence: Impact of Geometric Context for RNA Property Prediction

An official PyTorch implementation of the paper Beyond Sequence: Impact of Geometric Context for RNA Property Prediction.

Contribution

We introduce a diverse collection of RNA datasets, including newly annotated 2D and 3D structures, covering various prediction tasks at nucleotide and sequence levels across multiple species.
We provide a unified testing environment to evaluate different types of machine learning models for RNA property prediction, including sequence models for 1D, graph neural networks for 2D, and equivariant geometric networks for 3D RNA representations.
We conduct a comprehensive analysis of how different models perform under various conditions, such as limited data and labels, different types of sequencing errors, and out-of-distribution scenarios. We highlight the trade-offs and contexts in which each modeling approach is most effective, guiding researchers in selecting suitable models for specific RNA analysis challenges.
We also introduce novel modifications to existing 3D geometric models based on biological prior, specifically optimizing them for handling large-scale point cloud RNA data, thus improving the efficiency and performance of 3D models significantly.

1. Installation

Optional 1: Install dependencies manually. Create a new virtual environment.

conda create --name rna python=3.11	
conda activate rna

pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.3.0+cu121.html
pip install biopython rdkit optuna gdown pandas

Optional 2: Install dependencies with conda environment.

conda env create -f environment.yml

2. Data

Data can be downloaded from the original sources and preprocessed as explained in rnadatasets:

3. Experiments

For running files ending with _1d.py, _1d2d.py, _2d.py, _3d.py, they are for 1D, 1D2D, 2D, 3D data, respectively. The corresponding relationship is summarized in the following table.

Dimension	Method
`run_1d.py`	Transformer1D
`run_1d2d.py`	Transformer1D2D
`run_2d.py`	GCN, GAT, ChebNet, GraphGPS, GraphTransformer
`run_3d.py`	EGNN, SchNet

We provide running scripts and all optimized hyperparameters (as listed in Table 5 of the paper) in the scripts.sh file.

3.1 Run Main Experiments

Take running Transformer1D on CovidVaccine1D for example:

python run_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8

3.2 Run Robustness Experiments

Replace the dataset in 3.1 with the corresponding noisy dataset, with a certain noise level. For example, run Transformer1D on CovidVaccine1DNoisy with noise level 0.1:

python run_1d.py --model tsfm --dataset CovidVaccine1DNoisy --noise_level 0.1 --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8

3.3 Run Generalization Experiments

Replace the file name in 3.1 with clean_noisy. It gives the results on 6 different noisy datasets, respectively. For example, run Transformer1D on CovidVaccine1D:

python clean_noisy_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8

3.4 Fraction of Training Data

Add argument --train_ratio to control the ratio of training data. For example, run Transformer1D on CovidVaccine1D with 50% training data:

python run_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8 --train_ratio 0.2 --val_ratio 0.1 --test_ratio 0.1

Note:

for Tc-riboswitches and Fungal datasets, we follow the original paper to use fixed splits. Therefore, this will only vary the training data, while the validation and test data remain the same.
Here the train_ratio is the ratio of all data, including training, validation and test data.

3.5 Fraction of Sequence labelling

Run label_ratio with argument --seq_label_ratio to control the fraction of sequence labelling. For example, run Transformer1D on CovidVaccine1D with 20% sequence labelling:

python label_ratio_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8 --seq_label_ratio 0.2

Issues

If you encounter any problems, please file an issue along with a detailed description.

Citation

If you found this work useful, please consider citing

@inproceedings{xu2024beyond,
    title={Beyond Sequence: Impact of Geometric Context for RNA Property Prediction},
    author={Xu, Junjie and Moskalev, Artem and Mansi, Tommaso and Prakash, Mangal and Liao, Rui},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=9htTvHkUhh}
}

@inproceedings{xu2025harmony,
    title={{HARMONY}: A Multi-Representation Framework for {RNA} Property Prediction},
    author={Junjie Xu and Artem Moskalev and Tommaso Mansi and Mangal Prakash and Rui Liao},
    booktitle={ICLR 2025 Workshop on AI for Nucleic Acids},
    year={2025},
    url={https://openreview.net/forum?id=nzUsRhtnBa}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
fig		fig
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean_noisy_1d.py		clean_noisy_1d.py
clean_noisy_1d2d.py		clean_noisy_1d2d.py
clean_noisy_2d.py		clean_noisy_2d.py
clean_noisy_3d.py		clean_noisy_3d.py
dataset.py		dataset.py
environment.yml		environment.yml
label_ratio_1d.py		label_ratio_1d.py
label_ratio_1d2d.py		label_ratio_1d2d.py
label_ratio_2d.py		label_ratio_2d.py
label_ratio_3d.py		label_ratio_3d.py
loader.py		loader.py
run_1d.py		run_1d.py
run_1d2d.py		run_1d2d.py
run_2d.py		run_2d.py
run_3d.py		run_3d.py
scripts.sh		scripts.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Beyond Sequence: Impact of Geometric Context for RNA Property Prediction

Table of Contents

Contribution

1. Installation

2. Data

3. Experiments

3.1 Run Main Experiments

3.2 Run Robustness Experiments

3.3 Run Generalization Experiments

3.4 Fraction of Training Data

3.5 Fraction of Sequence labelling

Issues

Citation

About

Uh oh!

Releases

Packages

Languages

License

johnsonandjohnson/Beyond-Sequence

Folders and files

Latest commit

History

Repository files navigation

Beyond Sequence: Impact of Geometric Context for RNA Property Prediction

Table of Contents

Contribution

1. Installation

2. Data

3. Experiments

3.1 Run Main Experiments

3.2 Run Robustness Experiments

3.3 Run Generalization Experiments

3.4 Fraction of Training Data

3.5 Fraction of Sequence labelling

Issues

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages