Finding Drug Candidate Hits With a Hundred Samples: Ultra-low Data Screening With Active Learning

This repository contains data and analysis scripts for the (preprint) paper Finding Drug Candidate Hits With a Hundred Samples: Ultra-low Data Screening With Active Learning.

Grapher.ipynb – Generates figures and performs analysis for the main paper.
GrapherSIDtp.ipynb – Analyzes and visualizes data for the DTP section of the Supporting Information.
GrapherSIDds10.ipynb – Analyzes and visualizes data for the DDS10 section of the Supporting Information.
GrapherSIDds10.ipynb - Utility functions for generating the graphs.

CSV Result Files

All results are stored in .csv files. Each file contains the following columns:

replicate: Current replicate number (1–30).
rank: Active learning iteration (1–5).
top-100/1320 model: Number of true top-100/1320 molecules predicted by the model (not used in current figures).
top-100/1320 acquired: Fraction of top-100/1320 molecules found in the acquired set (e.g., 0.003030 * 1320 ≈ 4 molecules for the DTP dataset).

Directory Overview

all/ – Results without PADRE data augmentation. 10k/ corresponds to DDS10; 130k/ to DTP.
all_PADRE/ – Results with PADRE data augmentation. 10k/ is DDS10; 130k/ is DTP.
01_BestCombosDTP/ – A curated collection of the best descriptor/model combinations for the DTP dataset, as shown in the main paper.
02_BestCombosDDS10/ – Same as above, but for the DDS10 dataset.
starting_SMILE_sets/ - The starting sets for the active learning experiments.

Reproducing Experiments

This project uses the MDRMF package to run active learning experiments from YAML configuration files.

To reproduce an experiment (e.g., acquisition function tests on DTP):

Locate the relevant settings file, for example:

all_PADRE/130k/acquisition130k_CDDD_pair-240920-162255-5d/settings.yaml

Install the MDRMF package. See MDRMF for details.
Download dataset(s) (see below)

Run the experiment with:

python -m MDRMF.experimenter path/to/settings.yaml

💿 Datasets can be downloaded from here.

Starting Molecule Sets

The 30 Molecule starting sets are available in the setting.yaml files. They are placed at the very top under unique_initial_sample. For convenience the SMILES are provided as lists in starting_SMILE_sets/ along with the starting sets for the data enrichment tests.

These directories contain lists of SMILES including the molecules used for enrichment.

📦 Built with MDRMF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Finding Drug Candidate Hits With a Hundred Samples: Ultra-low Data Screening With Active Learning

Contents

CSV Result Files

Directory Overview

Reproducing Experiments

Starting Molecule Sets

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
01_BestCombosDTP		01_BestCombosDTP
02_BestCombosDDS10		02_BestCombosDDS10
all		all
all_PADRE		all_PADRE
starting_SMILE_sets		starting_SMILE_sets
x_figures		x_figures
x_figures_SI		x_figures_SI
.gitignore		.gitignore
Grapher.ipynb		Grapher.ipynb
GrapherCode.py		GrapherCode.py
GrapherSIDds10.ipynb		GrapherSIDds10.ipynb
GrapherSIDtp.ipynb		GrapherSIDtp.ipynb
README.md		README.md

jensengroup/AL_paper

Folders and files

Latest commit

History

Repository files navigation

Finding Drug Candidate Hits With a Hundred Samples: Ultra-low Data Screening With Active Learning

Contents

CSV Result Files

Directory Overview

Reproducing Experiments

Starting Molecule Sets

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages