GitHub - RekerLab/active-subsampling: Using active learning for data curation

Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning

We use active machine learning as an autonomous and adaptive data subsampling strategy and show that active learning-based subsampling can lead to better molecular machine learning performance when compared to both training models on the complete training data and 19 state-of-the-art subsampling strategies. We find that active learning is robust to errors in the data, highlighting the utility of this approach for low-quality datasets. Taken together, we here describe a new, adaptive machine learning pre-processing approach and provide novel insights into the behavior and robustness of active machine learning for molecular sciences.

For more information, please refer to: Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning

If you use this data or code, please kindly cite: Wen, Y., Li, Z., Xiang, Y., & Reker, D. (2023). Improving molecular machine learning through adaptive subsampling with active learning. Digital Discovery, 2(4), 1134-1142.

Files

Example_workflow_for_AL_Subsampling.ipynb contains an example notebook that runs BBBP but can be run out of the box on a local machine or on Google Colab to apply this technique to new datasets

Installation

pip install git+https://github.com/RekerLab/active-subsampling.git

Quickstart

Datasets can be loaded from DeepChem

#load data
import deepchem as dc
tasks, data, transformers = dc.molnet.load_bbbp(splitter=None)
bbbp = data[0]

Model and performance metric need to be initialized, we recommend random forest models and Matthew's correlation coefficient (MCC)

# initialize model and performance metric
from sklearn.metrics import matthews_corrcoef as mcc
from sklearn.ensemble import RandomForestClassifier as RF
model = RF()
metric = mcc

Active learning subsampling can be directly called using the al_subsampling function

# run active learning
from active_subsampling import ALSubsampling
result = ALSubsampling.al_subsampling(model, bbbp, metric, 5 )

Results can be visualized by plotting the learning curve

# visualize learning curve (result[0] is all MCC values on validation set)
pl.plot(np.mean(result[0],axis=0))
pl.savefig("learning_curve.pdf")
pl.close()

Delta performance can be directly calculated from the resulting curves

# report deltaPerformance 
print(ALSubsampling.calc_deltaPerformances(result))

Subsampled data can be extracted by calling the subsample_data function

# extract AL subsample data
subsample = ALSubsampling.subsample_data(model, data, metric, 5)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
active_subsampling		active_subsampling
Example_workflow_for_AL_Subsampling.ipynb		Example_workflow_for_AL_Subsampling.ipynb
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning

Files

Installation

Quickstart

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

RekerLab/active-subsampling

Folders and files

Latest commit

History

Repository files navigation

Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning

Files

Installation

Quickstart

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages