GNOME

GNOME — Graph-based Neural Organometallic Magnetic (Shift) Estimator

Downloading and Processing Data with `datasets.py`

This guide will walk you through the steps to download and process the NMR dataset using the datasets.py script.

Prerequisites

Before you begin, ensure you have the following installed:

Python 3.11
PyTorch (with CUDA if available)
PyTorch Geometric
RDKit (for molecular data processing)
torch-scatter (for data processing)

Either use the provided environment.yml:

conda env create -f environment.yml
conda activate GNOME

Or run the following (you have to adjust for your CUDA version:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install pyg -c pyg
pip install torch-scatter
conda install conda-forge::rdkit
conda install conda-forge::tensorboard
conda install anaconda::h5py
conda install anaconda::scikit-learn
conda install conda-forge::mendeleev

If you have trouble installing torch-scatter, don't worry, there is a backup solution in place for the MPGNN model.

Step 1: Clone the Repository

First, clone the repository to your local machine:

git clone https://github.com/varghele/GNOME.git
cd GNOME

Step 2: Download the Raw Data

To download the raw NMR dataset, run the following command:

python dataset.py --data_dir data --download

This will download the nmrshiftdb2withsignals.sd file from SourceForge and save it in the data/raw directory.

Step 3: Process the Data

Once the raw data is downloaded, you can process it into PyTorch Geometric format by running:

python dataset.py --data_dir data --process

This step takes roughly 5 minutes and will:

Load the .sd file using RDKit: The script uses RDKit to parse the .sd file and extract molecular structures.
Extract atom features, bond features, and NMR shifts: For each molecule, the script extracts:
- Atom features: Atomic number, degree, formal charge, number of hydrogens, aromaticity, and hybridization.
- Bond features: Bond type, conjugation, ring membership, and bond length.
- NMR shifts: 13C NMR shifts for each atom (if available).
Create ghost bonds: The script adds ghost bonds between atoms that are not connected by real bonds. These bonds are labeled with a bond type of 4 and include the distance between the atoms as a feature.
Save the processed data: The processed data is saved as processed_data.pt in the data/processed directory.

Step 4: Verify the Dataset

To verify that the dataset was processed correctly, you can check the number of molecules and inspect the first molecule:

python dataset.py --data_dir data

This will print:

The number of molecules in the dataset.
The first molecule in the dataset (as a PyTorch Geometric Data object).

Step 5: Run training

Take a look at args.py what arguments you can pass to the pipeline. Be aware however, that not all models share all arguments, as some are model specific.

python main.py [**kwargs]

Step 6: Monitor with Tensorboard

During training, model performance is logged for tracking and hyperparameter tuning. Logging is available via tensorboard. In the main directory, run:

python tensorboard --logdir checkpoints

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
models		models
notebooks		notebooks
test		test
utils		utils
LICENSE		LICENSE
README.md		README.md
args.py		args.py
dataset.py		dataset.py
environment.yml		environment.yml
environment_win.yml		environment_win.yml
main.py		main.py
predict_organometallics.py		predict_organometallics.py
run_experiments.py		run_experiments.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GNOME

Downloading and Processing Data with `datasets.py`

Prerequisites

Step 1: Clone the Repository

Step 2: Download the Raw Data

Step 3: Process the Data

Step 4: Verify the Dataset

Step 5: Run training

Step 6: Monitor with Tensorboard

About

Uh oh!

Releases

Packages

Languages

License

varghele/GNOME

Folders and files

Latest commit

History

Repository files navigation

GNOME

Downloading and Processing Data with datasets.py

Prerequisites

Step 1: Clone the Repository

Step 2: Download the Raw Data

Step 3: Process the Data

Step 4: Verify the Dataset

Step 5: Run training

Step 6: Monitor with Tensorboard

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Downloading and Processing Data with `datasets.py`

Packages