Skip to content

johnsonandjohnson/Beyond-Sequence

Repository files navigation

Beyond Sequence: Impact of Geometric Context for RNA Property Prediction

An official PyTorch implementation of the paper Beyond Sequence: Impact of Geometric Context for RNA Property Prediction.

Alt text

Table of Contents

Contribution

  • We introduce a diverse collection of RNA datasets, including newly annotated 2D and 3D structures, covering various prediction tasks at nucleotide and sequence levels across multiple species.
  • We provide a unified testing environment to evaluate different types of machine learning models for RNA property prediction, including sequence models for 1D, graph neural networks for 2D, and equivariant geometric networks for 3D RNA representations.
  • We conduct a comprehensive analysis of how different models perform under various conditions, such as limited data and labels, different types of sequencing errors, and out-of-distribution scenarios. We highlight the trade-offs and contexts in which each modeling approach is most effective, guiding researchers in selecting suitable models for specific RNA analysis challenges.
  • We also introduce novel modifications to existing 3D geometric models based on biological prior, specifically optimizing them for handling large-scale point cloud RNA data, thus improving the efficiency and performance of 3D models significantly.

1. Installation

Optional 1: Install dependencies manually. Create a new virtual environment.

conda create --name rna python=3.11	
conda activate rna
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.3.0+cu121.html
pip install biopython rdkit optuna gdown pandas

Optional 2: Install dependencies with conda environment.

conda env create -f environment.yml

2. Data

Data can be downloaded from the original sources and preprocessed as explained in rnadatasets:


3. Experiments

For running files ending with _1d.py, _1d2d.py, _2d.py, _3d.py, they are for 1D, 1D2D, 2D, 3D data, respectively. The corresponding relationship is summarized in the following table.

Dimension Method
run_1d.py Transformer1D
run_1d2d.py Transformer1D2D
run_2d.py GCN, GAT, ChebNet, GraphGPS, GraphTransformer
run_3d.py EGNN, SchNet

We provide running scripts and all optimized hyperparameters (as listed in Table 5 of the paper) in the scripts.sh file.

3.1 Run Main Experiments

Take running Transformer1D on CovidVaccine1D for example:

python run_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8

3.2 Run Robustness Experiments

Replace the dataset in 3.1 with the corresponding noisy dataset, with a certain noise level. For example, run Transformer1D on CovidVaccine1DNoisy with noise level 0.1:

python run_1d.py --model tsfm --dataset CovidVaccine1DNoisy --noise_level 0.1 --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8

3.3 Run Generalization Experiments

Replace the file name in 3.1 with clean_noisy. It gives the results on 6 different noisy datasets, respectively. For example, run Transformer1D on CovidVaccine1D:

python clean_noisy_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8

3.4 Fraction of Training Data

Add argument --train_ratio to control the ratio of training data. For example, run Transformer1D on CovidVaccine1D with 50% training data:

python run_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8 --train_ratio 0.2 --val_ratio 0.1 --test_ratio 0.1

Note:

  1. for Tc-riboswitches and Fungal datasets, we follow the original paper to use fixed splits. Therefore, this will only vary the training data, while the validation and test data remain the same.
  2. Here the train_ratio is the ratio of all data, including training, validation and test data.

3.5 Fraction of Sequence labelling

Run label_ratio with argument --seq_label_ratio to control the fraction of sequence labelling. For example, run Transformer1D on CovidVaccine1D with 20% sequence labelling:

python label_ratio_1d.py --model tsfm --dataset CovidVaccine1D --lr 0.001 --weight_decay 0 --d_model 128 --dim_feedforward 128 --nhead 8 --num_encoder_layers 8 --seq_label_ratio 0.2

Issues

If you encounter any problems, please file an issue along with a detailed description.

Citation

If you found this work useful, please consider citing

@inproceedings{xu2024beyond,
    title={Beyond Sequence: Impact of Geometric Context for RNA Property Prediction},
    author={Xu, Junjie and Moskalev, Artem and Mansi, Tommaso and Prakash, Mangal and Liao, Rui},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=9htTvHkUhh}
}
@inproceedings{xu2025harmony,
    title={{HARMONY}: A Multi-Representation Framework for {RNA} Property Prediction},
    author={Junjie Xu and Artem Moskalev and Tommaso Mansi and Mangal Prakash and Rui Liao},
    booktitle={ICLR 2025 Workshop on AI for Nucleic Acids},
    year={2025},
    url={https://openreview.net/forum?id=nzUsRhtnBa}
}

About

Impact of Geometric Context for RNA Property Prediction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published