This repository contains code for predicting the aqueous solubility of organic molecules using machine learning models. The models and dataset are based on the research paper: Predicting Aqueous Solubility of Organic Molecules Using Deep Learning Models with Varied Molecular Representations.
- Pull Original Code
- Pull the pnnlsolpaper folder from the original repository:
# pull the original PNNL codebase
git submodule init
git submodule update
- Then apply the patch set:
bash apply_patches.bash
-
Download Data: Download the dataset file named
dataset.csv
from this link and save it asdata.csv
in the./data
folder. -
Generate Features:
- Generate Pybel coordinates and Molecular Dynamics (MDM) features by running
create_data.py
in the./data
folder:cd ./pnnlsolpaper/data python create_data.py
- Then return to the root folder
cd ../..
- Generate Pybel coordinates and Molecular Dynamics (MDM) features by running
-
Train Models:
- To train the MDM model, run
pnnlsolpaper/mdm/train.py
as a package (command written assuming the root directory):python -m pnnlsolpaper.mdm.train
- To train the GNN model, run
pnnlsolpaper/gnn/train.py
:python -m pnnlsolpaper.gnn.train
- To train the SMI model, run
pnnlsolpaper/smi/train.py
:python -m pnnlsolpaper.smi.train
- To train the MDM model, run
-
Make Predictions:
(NOTE: this step is optional)- Use the
predict.ipynb
files in each model's folder to make predictions (note: this step is optional):Repeat the above steps for thecd pnnlsolpaper/mdm/ jupyter notebook predict.ipynb
gnn
andsmi
folders. - Afterwards return to the root directory:
cd ../..
- Use the
-
Ensemble Models:
- To ensemble the models, run the following scripts from the ensemble folder:
cd ensemble/ python CV.py python Optuna.py python KNN.py
- To ensemble the models, run the following scripts from the ensemble folder:
-
Compare Predictions:
- To compare predictions from individual models with ensemble methods, use the
ensemble_prediction.ipynb
notebook:jupyter notebook ensemble_prediction.ipynb
- To compare predictions from individual models with ensemble methods, use the
For detailed instructions on how to run the models, featurize the data, and other specifics, please refer to the original research paper linked above. The methods and techniques described in the paper are critical for understanding and effectively using this repository.