Main Repository for Data Science Master Thesis at University of Vienna 2024-25:
"Benchmarking and Optimizing Deep Learning Architectures for Protein-to-mRNA Ratio Prediction"
- Reference paper: Hernandez-Alias et al (2023), Using protein-per-mRNA differences among human tissues in codon optimization
- GENCODE data format glossary
- PTR ratios data source
Create virtual environment with mamba/conda
mamba env create -f environment_files/environment_linux_py3.10.yml
mamba activate master-env
Create a project folder for training (data
) and model data (runs
), and specify project path in src/utils/utils.py
in the function set_project_path()
.
root
├── project folder
│ ├── data
│ ├── runs
│ │ ├── dev
│ │ │ ├── logs
│ │ │ ├── weights
│ │ │ ├── weights_best
│ │ ├── lstm
│ │ │ ├── ...
│ │ ├── xlstm
│ │ │ ├── ...
│ │ ├── gru
│ │ │ ├── ...
│ │ ├── transformer
│ │ │ ├── ...
│ │ ├── best_model (PTRNet)
│ │ │ ├── ...
└── master-thesis (this repo)
This repository supports a deep learning benchmark study for predicting protein-to-mRNA (PTR) ratios using mRNA sequence and structure features.
src/config/
: YAML configs for model architecture, training, and hyperparameter tuning (e.g., for Mamba, LSTM, Transformer, etc.).
src/data_handling/
: Scripts for preprocessing, structure prediction, codon/nucleotide dataset creation, and stratified splitting.
src/models/
: Implementation of deep learning models (MLP, CNN, RNNs, Transformer, xLSTM, Mamba, LegNet, PTRnet).- Modularized by model type with shared predictor logic.
src/pretraining/
: Tools for masked language model (MLM) pretraining and motif discovery.
src/training/
: Training logic, early stopping, learning rate scheduling, and Optuna-based tuning.
src/evaluation/
: Model evaluation, metrics, predictions, and plotting utilities.
src/utils/
,src/log/
: Helper functions, logging setup, and device management.
src/main.py
: Main script for running training or tuning, configurable via CLI flags.src/multi_run*.sh
: Example scripts to train multiple models sequentially.
Install folding algorithms for secondary structure predictions. Follow arnie tutorial.
The bpRNA code for loop type predictions is already in the repo.
Export environment dependencies
mamba env export -n master-env > environment_files/environment_linux_py3.10.yml
Update environment dependencies
mamba env update -n master-env -f environment_files/environment_linux_py3.10.yml
aim up
On how to search and filter runs in AIM: https://aimstack.readthedocs.io/en/latest/using/search.html
From within the data folder for optuna, run:
optuna-dashboard sqlite:////path/to/optuna/model_name.db
jupyter notebook --no-browser --port=8888
Count files in a directory
ls -1 | wc -l