Predicting the aqueous solubility of small organic molecules is a critical step in early-stage drug discovery, influencing a compound's absorption, distribution, metabolism, and excretion (ADME) properties. This project explores and compares two primary computational approaches for classifying molecular solubility:
- Descriptor-based Quantitative Structure-Property Relationship (QSPR) Modeling: Utilizes molecular descriptors generated by RDKit with a LightGBM classifier.
- Graph Convolutional Neural Networks (GCNs): Employs graph representations of molecules with PyTorch Geometric to learn predictive features.
- Aqueous solubility dataset (containing data for approximately 9,000 unique molecules): Published in Nature Scientific Data
Example molecules from the dataset.
Visualization from GCN hyperparameter tuning.
- PyTorch
- PyTorch-Geometric
- RDKit
- DeepChem
- Scikit-Learn
- LightGBM
- Pandas
- NumPy
- Seaborn
- Matplotlib
- tqdm
- Wandb (for
hyperparameters search.ipynb
)
analytics.ipynb
: Data exploration, feature analysis, and visualization.descriptor-based-model.ipynb
: Implementation and evaluation of the QSPR model using molecular descriptors and LightGBM.hyperparameters search.ipynb
: Hyperparameter optimization for the GCN model using Weights & Biases.model training and evaluation.ipynb
: Training, evaluation, and analysis of the GCN model.custom_dataset.py
: Defines the custom PyTorch Geometric dataset for loading molecular graphs.model.py
: Contains the GCN model architecture definition.trainer.py
: Implements the training and validation loop for the GCN model.utils.py
: Utility functions used across notebooks (e.g., for evaluation metrics, plotting).data/
: Directory containing the dataset files.
Want to read a more readable version of this project? You can find it here: Prediction of Aqueous Solubility