Structure-based virtual screening of ultra-large chemical libraries is a powerful strategy in early-stage drug discovery. However, the computational expense of physics-based docking necessitates more efficient methods for navigating chemical space. Building upon the active learning framework proposed by Zhou et al. in their work "An artificial intelligence accelerated virtual screening platform for drug discovery", 1 we present AiDockFlow, a fully-automated pipeline designed to improve the efficiency and robustness of hit discovery campaigns.
Our methodology introduces several key enhancements to the data curation and model training stages. The process begins by curating a high-fidelity set of known actives for a given target, which are partitioned using a scaffold-based split to ensure robust training and validation sets. This is coupled with the generation of a diverse "druglike-centroid" library from ZINC15 to facilitate broad exploration of chemical space, as per the approach proposed by Zhou et al.. These high-fidelity actives are labelled as 'binders' and are used for initial training of the surrogate Graph Neural Network (GNN), along with a large sample of the diverse centroids (labelled as 'non-binders').
Another core contribution of our work is to refine the active learning loop where a Graph Neural Network (GNN) surrogate is iteratively retrained. We introduce a three-tiered labelling strategy based on RosetteVS docking scores where the top 10% of docked compounds are labelled as "binders", the bottom 10% are labelled as "non-binders", and the ambiguous middle 50% are ignored to ensure the model is trained only on high-confidence data. Additionally, for reduced computational expense, we replace the cumulative training dataset approach with a "replay buffer" (a smaller, random sample of previously seen compounds) to prevent catastrophic forgetting of model learning. Our final contribution is the addition of PAINS and Brenk filters in addition to the suite of medicinal chemistry filters proposed by Zhou et al.
1. Data Ingestion & Curation
1.1 Retrieve High-Fidelity Actives
1.2 Split High-Fidelity Actives Data
1.3 Build “Druglike-Centroid Library”
1.4 Create “round-0” Candidate Training Dataset, & Validation/Testing Datasets
1.5 Prepare Target Structure
2. Featurization (Round 0 Initially)
2.1 Canonicalize Ligands and Add Explicit Hydrogens
2.2 Generate Graph Objects & Attach Initial Labels
2.3 Create Cumulative Training Dataset (graphs_master.pt
)
2.4 Create 3-D Conformers
2.5 Cache Fingerprints (1024-Bit ECFP4)
2.4 Create “Seen” Hash-Set
3. Surrogate GNN Model Training & Tuning
3.1 Model Choice (Round 0 Only)
3.2 Bayesian/Monte-Carlo Hyperparameter Search (Round 0 Only)
3.3 Training/Fine-Tuning
3.3.1 Round 0
3.3.2 Rounds ≥ 1 — “Light Fine-Tune”
3.4 Inference
4. Active Learning Loop (VSX-Derived Labels)
4.1 Candidate Selection
4.2 Featurise Previously Unseen Ligands (Canonicalize, Generate Graph Object, Create 3-D Conformers, Cache Fingerprints & Update “Seen” Hash-Set)
4.3 VSX Docking (Fast Rosetta Run)
4.4 Calculate ΔG & Add Labels
4.5 Add Labels to Cumulative Training Pool
4.6 Convergence Check
4.7 Build Next Round Training Dataset (If Not Converged)
5. High-Precision Docking & Post-Processing
5.1 Candidate List
5.2 VSH Docking
5.3 Medicinal Chemistry Filters
5.4 Clustering & Diversity
5.5 Exports
6. Visualization & Reporting
- Target to be defined by Uniprot Accession Number in training config file.
- Gather SMILES and bioactivity measure values for compounds that are known binders to the target from public sources (ChEMBL, PubChem, etc.) via Python APIs or bulk downloads (SDF/CSV).
- Likely parameters to use to determine known binders include inhibition constant (Ki) and dissociation constant (Kd). Set threshold value (in nM) for binding parameter. These will be defined in training config file.
- When extracting compounds, keep InChIKey, binding parameter (Ki or Kd), and SMILES string.
- If a compound appears multiple times keep the lowest value and record count, mean, median, and σ.
- Merge compounds from each public source and deduplicate based on InChIKey.
- Clean and normalize/standardize molecules:
- Remove salts.
- Protonation/tautomer standardization at pH 7.4 with
dimorphite-dl
. - Normalize binding parameter to nM.
- Calculate InChIKey if not present in public sources.
- Canonicalize SMILES with RDKit.
- Save compounds as
validated_actives.parquet
file:- Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
- Parquet table (columnar, compressed) will have the following schema:
column | dtype | comment |
---|---|---|
inchi_key |
string | 27‑char standard InChIKey (UU‑case) |
smiles |
string | RDKit canonical isomeric SMILES |
measure |
category | “Ki” or “Kd” |
value_nM` |
float32 | Best (lowest) experimental value in nM |
n_measurements |
int16 | Count of independent measurements |
source_db |
category | “ChEMBL” / “PubChem” |
target_uniprot |
string | Target protein accession |
- Do random stratified split 70/15/15% (train, validation, test). Exact split ratio to be defined in training config file:
- Group actives by Bemis‑Murcko scaffold (
rdMolDescriptors.GetScaffoldForMol
). - Shuffle scaffolds with fixed RNG seed.
- Allocate 70 % of scaffolds to train, 15 % to validation, 15 % to test.
- Group actives by Bemis‑Murcko scaffold (
- Write three Parquet files with same schema as before -
train_pos.parquet
,val_pos.parquet
,test_pos.parquet
.
Based on the paper by Zhou et al., we will build a specialized subset of ~13 million molecules, referred to as the druglike-centroid library. These will be extracted from the ZINC15 3D druglike database (~493.6 million molecules) by:
- Clustering similar molecules from the ZINC 3D druglike database, using a cutoff of 0.6 Tanimoto similarity (defined in training config file).
- From each cluster, the smallest moleculewill be selected and added to the library, serving as the centroid of the cluster.
- This process will result in the formation of the druglike-centroid library, which will contain ~13 million molecules.
The purpose of creating the druglike-centroid library is to ensure that the model is exposed to a wide range of chemical space during the initial iteration. The steps for this include:
- Download ZINC15 drug-like 3D SMILES dataset.
- Generate 1024‑bit ECFP4 fingerprints in streamed fashion (RDKit).
- Cluster with Tanimoto 0.6 (defined in training config file) using Butina‑like algorithm (Faiss for speed).
- Remove any centroid whose Tanimoto similarity to any of the high-fidelity actives is 1.0 (i.e. they are the same molecule).
- Calculate InChIKey for each molecule and save druglike-centroid library to
centroid_pool.parquet
the following schema:inchi_key
(str)smiles
(str)
In the training workflow from Zhou et al., for the first iteration, 0.5 million and 1 million molecules were randomly selected from the centroid library as the training and testing datasets respectively.
We will slightly alter this to sample 0.5 million molecules, followed by a random stratified 70/15/15% split (with both the sample and split to be tunable in the config file). The thinking is:
- Training Dataset:
- We do not want to dilute the high-fidelity Ki/Kd positives in the training dataset too much, but at the same time dilute them enough with negatives from the
centroid_pool.parquet
pool so that the model learns from broad chemical space. - Therefore, ~350,000 negatives from the
centroid_pool.parquet
pool should be a good starting point.
- We do not want to dilute the high-fidelity Ki/Kd positives in the training dataset too much, but at the same time dilute them enough with negatives from the
- Validation/Testing Dataset:
- We will start with ~75,000 negatives in both the validation and testing datasets.
- It is hoped that ~75,000 negatives, plus the positives in each will be enough to get a stable AUROC estimate.
The steps to create the round 0 training dataset, and the validation/testing datasets are:
- Randomly sample 500,000 compounds (defined in training config file) from
centroid_pool.parquet
(exploration) and save toround0_sampled_centroid_pool.parquet
. - Save unsampled pool to
round0_unsampled_centroid_pool.parquet
. This will be used in Step 4.1. to sample 250,000 undocked ligands for active learning loop at the end of round 0. - Do a random stratified 70/15/15% training/validation/testing split of
round0_sampled_centroid_pool.parquet
, saving totrain_neg.parquet
,val_neg.parquet
, andtest_neg.parquet
. - Append all positives from
train_pos.parquet
,val_pos.parquet
, andtest_pos.parquet
to the respective negative files from step 3 (exploitation). - We should now have
round0_full_train.parquet
,full_val.parquet
, andfull_test.parquet
. - Deduplicate all three datasets based on InChiKeys, shuffle and write to
.smi
files, followed by compression with gzip (e.g.round0_full_train.smi
, thenround0_full_train.smi.gzip
). - N.B. To prevent any data leakage validation and test datasets:
- Are to be frozen throughout the training regime. This means they should not be relabeled, nor should they be replaced with a new set of molecules.
- Should not be added to
graphs_master.pt
(see later) and should never be included in the active learning loop.
A .smi
file is plain text, tab-separated list of molecules:
- First column – SMILES string
- Second column – InchiKey
- Target to be defined by Uniprot Accession Number in training config file.
- Download PDB or AlphaFold model (target source database to be defined in training config file).
- Use PDBFixer to add missing residues/atoms, assign bonds, and remove water molecules.
- Use
propka
to add pH 7.4 protonation. - Use OpenBabel to add hydrogens.
- Energy-minimize side chains with Rosetta FastRelax (2,000 cycles, coordinates constraints).
- Convert to PDBQT with AutoDockTools script (
prepare_receptor4.py
). - Store copy of relaxed target in both PDB and PDBQT files (
target_prepped.pdb
andtarget_prepped.pdbqt
).
Data Type Conversions:
- SMILES → Graphs (for GNN machine learning).
- SMILES → ETKDG conformer in SDF files (for docking).
- SMILES → 1024-bit ECFP4 (for clustering, similarity searches, FNN pre-screen etc.)
- Load SMILES from training (and validation and testing if round 0)
.smi
files, convert tordkit.Mol
objects (Chem.MolFromSmiles(smi, sanitize=True)
). - Check if each SMILES are correct by checking that
rdkit.Mol
object is notNone
. - Add stereochemistry tags to each SMILES string via
rdkit.Chem.rdmolops.AssignStereochemistry(mol, cleanIt=True, force=True)
- Add explicit hydrogens via
rdkit.Chem.AddHs(mol)
. This is needed for 3-D embedding and for graphs that encode hydrogen counts. - Save cleaned SMILES as
cleaned_round0_full_train.smi.gz
,cleaned_full_val.smi.gz
, andcleaned_full_test.smi.gz
.
Because the GNN needs node/edge features, we need to build graph objects for each SMILES string.
- Load SMILES from
cleaned_round0_full_train.smi.gz
,cleaned_full_val.smi.gz
, andcleaned_full_test.smi.gz
. - For every molecule, build an atom-bond graph with:
- Element
- Degree
- Formal charge
- Aromatic flag
- Hybridization
- Ring flag
- Bond type
- Conjugated ring flag
- Recommended libraries include PyTorch‑Geometric, DGL or Chemprop (they all expose a “Mol ⇒ Data” helper).
- Each molecule graph needs to have its InChiKey assigned to it for identification, along with a label. (See below).
- Serialize the list with
torch.save
orjoblib
(~3 GB per 1 million molecules, compressible) tographs_round0_train.pt
,graphs_val.pt
, andgraphs_test.pt
(binary; tens of GB).
The graphs_*.pt
files are single PyTorch files that holds the following triplet for every ligand processed in round 0:
- InChiKey (attribute
inchi_key
): Immutable ID used throughout the pipeline. - Graph Object: Atoms/bonds and their features for the GNN.
- Label (attribute
y
): 1 = binder, 0 = non-binder, –1 = masked/unknown.
The graph object will be a PyG torch_geometric.data.Data
instance. The InChiKey will be the Data.inchi_key
attribute that is a Python str
data type. The Label will be the Data.y
attribute that is a Python int
data type that has been cast to a torch.long
data type.
For the initial labels, we want to set all the seed positives as binders (1), and all others as non-binders (0) initially:
seed_positives = set(train_pos.parquet.inchi_key)
- For every
mol_graph
, ifinchi_key
inseed_positives
:mol_graph.y = tensor([1])
- Else:
mol_graph.y = tensor([0])
We will also do the same for graphs_val.pt
, and graphs_test.pt
, but their labels will remain frozen for the entire training regime.
The graphs_master.pt
file is a single PyTorch file that holds the same triplet for every ligand processed so far.
For the initial graphs_master.pt
file we will simply make a copy of graphs_round0_train.pt
(N.B. we do not include any ligands from the validation or test datasets. However, this will be the cumulative training dataset and at the end of each round we will append all the new ligand graphs used in that round (i.e. graphs_round{r}_train.pt
), as well as update any newly calculated labels from that training round.
Because fast VSX docking needs initial 3-D coordinates, we must convert each ligand to an ETKDG conformer.
- Load SMILES from
cleaned_round0_full_train.smi.gz
. - Convert each SMILES string to an
rdkit.Mol
object viardkit.Chem.AllChem.EmbedMolecule
. - Energy-minimize with 200‑step MMFF or UFF minimization. This is a good enough minimization with minimal computational overhead.
- Batch-write ligands to 50 SDF files, with 10,000 ligands per SDF file, and
gzip
each file. This format is preferred by Rosetta’s “multi-ligand” mode. - E.g.
batch_0001.sdf.gz
,batch_0001.sdf.gz
, etc. (~20 MB each).
By caching each molecule as a 1024-bit ECFP4, we can do fast similarity searches, clustering, and an optional FNN pre-screen.
- Load SMILES from
cleaned_round0_full_train.smi.gz
. - Compute 1024-bit ECFP4 for each ligand and pack into NumPy
uint8
arrays viafps = [np.packbits(AllChem.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(smiles) ,2,1024)) for smiles in round0_candidates]
. - Save as
round0_train.fp.npy
. - N.B. If we later want to compare chemical space coverage between train and test, we can also load
cleaned_full_val.smi.gz
andcleaned_full_test.smi.gz
and save toval.fp.npy
andtest.fp.npy
respectively.
We need to keep track of what training ligands the GNN will have seen after round 0. This includes all training molecules that have already been converted to graphs and ETKDG conformers. N.B. To avoid data leakage we do not include validation or test ligands as they are not to be used in the active learning loop.
- Read the Training Ligand InChiKeys Just Processed:
- Load InChiKeys from
cleaned_round0_full_train.smi.gz
.
- Load InChiKeys from
- Save InChiKeys to Disk:
- Insert every key into an in-memory
set()
and then serialize it to disk by saving toseen_inchikeys.pkl.gz
.
- Insert every key into an in-memory
At a high-level, our active learning technique can be broken down into:
- Surrogate GNN Model Training & Tuning
- The surrogate GNN model is continuously retrained with ~500,000 molecules/ligandsat the start of each full training iteration.
- Active Learning Loop (VSX-Derived Labels)
- The GNN model that just finished training is then used in the active learning loop to pick the next ~500,000 molecules → Dock them with VSX → Calculate each ligand’s ΔG → Turn the ΔG scores into new labels (binders=1, non-binders=0, masked=-1).
For the Surrogate GNN Model Training & Tuning, round 0 will differ from subsequent rounds:
- Round 0:
- We first train the GNN (GNN0) using the graphs from
graphs_round0_train.pt
. - As per step 2.2:
- The seed positives (i.e. the “high-fidelity actives” from Step 1.1) are given the label 1 as they are known binders.
- The centroid molecules are assumed to be non-binders and are given the label 0.
- We first train the GNN (GNN0) using the graphs from
- Round k (≥ 1):
- We retrain the GNN and fine-tune using everything that has been labelled from the previous round’s active learning loop.
- Can use Direct-MPNN from Chemprop, use 3-4 message-passing steps, and a hidden size of 300.
- The Pytorch-Geometric equivalent is
GINConv
orGATv2Conv
.
- Prepare a stratified 80/20 split of the labelled set
graphs_round0_train.pt
. - Define initial hyperparameters as:
hidden_size
= 300message_depth
= 3dropout
= 0.2learning_rate
= 3e-4weight_decay
= 1e-6
- Use a small Optuna or Ray Tune search to find the Area Under the Receiver Operating Characteristic Curve (AUROC) on the 20% validation slice (binary 1 v 0), with a TPE or BOHB sampler.
- Do ~25 trails, with each trial training for 10 epochs with early stopping.
- Pick best AUROC trial and dump its parameters (hidden size, message depth, dropout, learning rate, weight decay) to
best_hyperparams.json
. - Feed these
best_hyperparams.json
values into the training regime for round 0 and disable further hyperparameter searching. Only the weights are to evolve in later rounds, not the hyperparameters or model architecture.
In a molecule-focused GNN, the encoder is a stack of message-passing layers. Each layer:
- Collect information from every atom’s neighbours.
- Merges (e.g. sum/mean/attention) those messages.
- Updates the atom’s hidden vector.
After k
rounds every atom’s vector encodes information from atoms that are up to k
bonds away.
In its final layer, the encoder pools all atom vectors into one fixed-length molecule vector, usually by a simple sum or mean. The output head is the final classification or regression function that sits on top of this final encoder layer. It takes the single fixed-length molecule vector as input.
In most GNN libraries, this output head is just a two-layer MLP that turns this final vector into either:
- A single sigmoid score (in our case binder/non-binder).
- A single linear value (e.g. Predicted ΔG, Kd etc.).
The training dataset is graphs_round0_train.pt
and the general steps are:
- Initial Stage:
- Create a brand-new model with the hyperparameters chosen in Step 3.3.
- Epoch Loop Stage:
- Run for 20-30 epochs, with a minibatch shuffle.
- All weights (message-passing and output head) to be updated with AdamW.
- Early-stop when validation AUROC hasn’t improved for 5 epochs.
- External Validation:
- Run the freshly trained model on
graphs_val.pt
. - Compute AUROC, PR-AUC, accuracy, etc.
- Log to
val_metrics_round{r}.json
and TensorBoard.
- Run the freshly trained model on
- Save Stage:
- Save round 0 model full checkpoint (encoder + head) to
gnn_round0.pt
.
- Save round 0 model full checkpoint (encoder + head) to
As the encoder has already learned useful chemistry, we mainly want to let the output head adjust to the new labels produced by the latest VSX batch without overfitting.
The training dataset is graphs_round{r}_train.pt
and the general steps are:
- Load Previous Weights
- Freeze Encoder:
- If we have plenty of GPU and want a bit more flexibility, we could “thaw” the encoder at a lower LR (e.g. 0.1x) instead of a full freeze.
- Optimizer:
- Once we freeze the encoder, we will have two kinds of weights in the model:
- Frozen Encoder Weights: These have
param.requires_grad = False
, and so PyTorch will not calculate gradients for them. - Still Trainable Output Head Weights: The few dense layers (often a 2-layer MLP) that sit on top of the graph’s encoder/message layers. These have
param.requires_grad = True
.
- Frozen Encoder Weights: These have
- We must build the optimizer so that it only updates the weights that are still trainable. A common way to do this is
optim = torch.optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr = 2 * BASE_LR)
. - Note that because the list is small (maybe 20 k weights instead of 1 M+) we can raise the learning rate a little—often double the rate used when the whole network was trainable. The higher LR helps the output head adapt quickly to the new labels without risking instability in the frozen encoder.
- Once we freeze the encoder, we will have two kinds of weights in the model:
- Epochs:
- 2-3 epochs are usually enough. We will need to monitor validation AUROC. If it jumps quickly then plateaus, stop.
- External Validation:
- Run the freshly trained model on
graphs_val.pt
. - Compute AUROC, PR-AUC, accuracy, etc.
- Log to
val_metrics_round{r}.json
and TensorBoard.
- Run the freshly trained model on
- External Test (If End of Training):
- Run the freshly trained model on
graphs_test.pt
. - Compute AUROC, PR-AUC, accuracy, etc.
- Log to
test_metrics.json
and TensorBoard.
- Run the freshly trained model on
- Save:
- Save round
r
model full checkpoint (encoder + head) tognn_round{r}.pt
.
- Save round
For each round, once the training stage finishes, we run production-quality scoring/inference to generate fresh scores that the active learning loop will use to pick the next 0.5M candidates.
This is the same routine that will be used once the active learning loop has converged and model is fully trained and is ready to be used to score a commercial-sized library (e.g. hundreds of millions of compounds).
The general steps are:
- Mini-Batch the Library:
- This will feed the huge ligand list to the GPU in bite-size chunks.
- Only use centroid ligands that have not been docked yet (or the whole set if is the first round). I.E. Use
round{r}_unsampled_centroid_pool.smi
. - Mini-batch sizes should be 4096 molecules. This is small enough to fit in GPU memory but large enough for speed.
- Forward Pass Without Gradients:
- Ask the trained model to predict binding scores; no training, so we disable gradient bookkeeping.
- Write Results to Disk:
- Combine each SMILES with their predicted scores and save to
scores_round{r}.csv
for roundr
.
- Combine each SMILES with their predicted scores and save to
It should be noted that inference will be running while VSX docking from the previous round is still running on CPU. Therefore, GPU and CPU work will overlap, and GPU inference is not the bottleneck; VSX docking is.
- For current round
r
, openscores_round{r}.csv
from current round inference. - Sort by GNN score (high → low) and extract top 250,000 ligands.
- Randomly sample 250,000 SMILES from
round{r}_unsampled_centroid_pool.smi.gz
and save unsampled SMILES toround{r+1}_unsampled_centroid_pool.smi.gz
. This will be used to sample the next 250,000 SMILES in the next active learning iteration. - Combine top GNN-scored ligands from 2. and sample ligands from 3. to give 500,000 ligands for the upcoming VSX run.
- Save to
round{r}_active_learning.smi.gz
for roundr
.
4.2. Featurise Previously Unseen Ligands (Canonicalize, Generate Graph Object, Create 3-D Conformers, Cache Fingerprints & Update “Seen” Hash-Set)
- Read
round{r}_active_learning.smi.gz
and compare each InChiKey to the “seen” hash- setseen_inchikeys.pkl.gz
. - Create
graphs_round{r}_train.pt
file. - For every new ligand:
- Clean/canonicalize SMILES as per Step 2.1.
- Build graph object as per Step 2.2., append to both
graphs_round{r+1}_train.pt
andgraphs_master.pt
. - Generate ETKDG conformer as per Step 2.3. and collect into the SDF batches.
- Compute and store ECFP4 fingerprint as per Step 2.4.
- Append InChiKey to
seen_inchikeys.pkl.gz
.
- Convert each SMILES in
round{r}_candidates.smi.gz
to a 3D SDF as per Step 2.3. - Bundle the ligands into 50 SDF files of 10,000 ligands each. This format is preferred by Rosetta’s “multi-ligand” mode.
- On the cluster: submit 50 array jobs, each job uses for example 20 CPU cores. Rosetta’s VSX protocol (vsx.xml) docks one ligand in about 2 ½ minutes on one core → ~25 ligands / core / hour.
- Rosetta writes a
.sc
score table for every SDF batch (one row per ligand, ΔG in kcal mol⁻¹). This will give manyscore_XXXXX.sc
tables.
- Merge the 50 score tables into one spreadsheet (ligand ID + ΔG).
- Sort by ΔG:
- Top 10% (50,000 ligands)** → Label 1 (“binder”).
- Bottom 40% (200,000 ligands) → Label 0 (“non-binder”).
- Middle 50% (250,000 ligands) → Label -1 (“ignore”).
- N.B. We could make the top 10% cutoff tunable or dynamic, like the Zhou et al. paper where the cutoff becomes stricter over time. For example, we could start at 10% and gradually decrease it to 5% or 1% in later rounds to focus the model on identifying only the most potent binders as the search progresses.
- BUT, if any ligand is in the original high fidelity Kd dataset (
train_pos.parquet
), ensure it stays label 1 even if its ΔG isn’t in the top 10%. - Save the table as CSV file
round{r}_active_learning_labels.csv
with four columns:- InChiKey
- SMILES
- ΔG
- Label
- For every ligand in
round{r}_active_learning_labels.csv
locate its graph object ingraphs_master.pt
. - Attach the integer label (1/0/-1) to that graph.
To check for convergence, we will be checking three measures:
- Surrogate GNN Best Percentile Score: This reflects how much the surrogate GNN still believes there are unexplored “good” regions after this round.
- VSX ΔG Best Percentile: This tells us whether the highly scored GNN compounds for this round survive the first physics filter.
- Round Limit: Predefined maximum round limit.
Metric / Check | How to Measure | Why It Matters (Rationale) | Action / Output (After Each Round) |
---|---|---|---|
Surrogate GNN Best Percentile Score (Early Enrichment) |
|
If the model keeps discovering higher-scoring ligands, the search space still has fertile regions. | Calculated from scores_round{r}.csv and used in the stopping rule below. |
Artefact Stopping Rule (Tunable in Config File) |
Compute Δ(mean-score) relative to the previous round.
Stop criterion: abs(Δ) < 0.01 (scores are 0-1) for two consecutive rounds.
|
Provides an automatic stopping rule to terminate the search when enrichment plateaus. | Append a results line to progress.tsv . |
Fast-dock (VSX) Best Percentile |
abs(Δ) < 0.1 kcal mol-1 for two rounds.
|
This is the first physics-based sanity check. If these numbers plateau, your "cheap" GNN search is no longer finding better physical binders. | Append to progress.tsv and write a human-readable line to train.log . |
Round Limit (Safety Brake) |
Count the number of completed rounds.
Hard cap: E.g., 10 rounds. |
Prevents endless runs if metrics fluctuate around the stopping threshold without converging. | Checked before starting a new round; logged in progress.tsv . |
An example of how the progress.tsv
file may look:
Round | best_gnn_mean |
best_gnn_delta |
best_dG_mean |
best_dG_delta |
n_ligands_scored |
Comment |
---|---|---|---|---|---|---|
0 | 0.934 | -- | -10.7 | -- | 5,000,000 | Initial |
1 | 0.941 | 0.007 | -11.1 | -0.4 | 5,000,000 | Ok |
2 | 0.942 | 0.001 | -11.18 | -0.08 | 5,000,000 | Plateau? |
3 | 0.943 | 0.001 | -11.22 | -0.04 | 5,000,000 | Stopping |
There are two things to consider when deciding how to curate the next rounds training dataset from the graph_master.pt
cumulative dataset:
- Why not train using the whole
graph_master.pt
dataset?- After a few rounds
graph_master.pt
will contain millions of graphs, therefore using this full cumulative training dataset would explode GPU hours with diminishing returns. - A lot of these graphs will be labelled –1 (masked) and therefore will only add noise, not signal.
- Active-learning theory assumes we focus on the newest informative labels while keeping a stable set of high-confidence positives/negatives.
- After a few rounds
- Why not train on a brand new 500k ligands?
- This would result in the GNN model “forgetting” what it has learned about seed positives and strong negatives from previous rounds.
- As a result, validation metrics would bounce wildly because the label distribution would change on each new round.
Therefore, to build the next rounds training dataset (graphs_round{r+1}_train.pt
), we will include:
- All Seed Positives: All ligand graphs in
graph_master.pt
that are in the high-fidelity datasettrain_pos.parquet
. - Latest VSX Labelled Batch: All ligand graphs in that are in
round{r}_active_learning.smi
- Replay Buffer: A random 50 – 100k graphs from earlier rounds whose labels are not –1. This prevents forgetting and stabilizes AUROC. In the example below, this is a fixed 2% random sample, but we could make this tunable in the config file.
The psuedocode for this would look like:
master = torch.load('graphs_master.pt')
seed_pos_inchi = pd.read_parquet(‘train_pos.parquet`, columns=['inchkey’])
latest_active_learning_inchikeys = []
with gzip.open(‘round{r}_candidates.smi.gz’, ‘rt’) as file:
for line in file:
if not line.strip() # Skip blank lines
continue
inchikey = line.split(‘\t’, 1)[1].rstrip() # Second column
latest_active_learning_inchikeys.append(inchikey)
train_set = []
for graph in master:
if graph.inchi_key in seed_pos_inchi or graph.inchi_key in latest_active_learning_inchikeys:
train_set.append(graph)
elif random.random() < 0.02: # 2% Replay buffer:
train_set.append(graph)torch.save(train_set, f' graphs_round{r+1}_train.pt )
- Gather union of top‑0.5 % GNN‑ranked compounds across all rounds (~50 k‑100 k) and save to
final_candidates.smi
.
- Redock these top
final_candidates.smi
on a slower, higher‑quality docking mode. - For example, RosettaVS‑VSH with flexible side chains and a finer search.
- Save these scores to
vsh_scores.csv
We can then run various cheap cheminformatics quality filters and to remove any ligands that have poor druglikeness and those with pharmacologically undesirable substructures or potential toxicity. For example:
- cLogP:
- For efficient transport, the drug must be hydrophobic enough to partition into the lipid bilayer, but not so hydrophobic, that once it is in the bilayer, it will not partition out again.
- Additionally, hydrophobicity plays a major role in determining where drugs are distributed within the body after absorption and, consequently, in how rapidly they are metabolized and excreted.
- We can calculate the cLogP of each of these top candidates using the
rdkit.Chem.Crippen
module on the parent (desalted) structure, and remove those under a certain value (e.g. ≤ 3.5)
- Buried unsatisfied hydrogen-bond donors / acceptors:
- A buried polar atom that cannot donate/accept H-bonds in the pocket is an enthalpic penalty and often predicts low potency.
- After docking we can therefore run PLIP (
plip -f complex.pdb -o …
) or Rosetta’s InterfaceAnalyzer mover to list buried unsats Parse per-ligand counts. - If any ligand is below a certain threshold (e.g. ≤ 1 buried unsat per ligand), we can therefore remove it.
- Torsion outliers (CSD Torsion Library):
- Unusual torsion/dihedral angles indicate strained conformations with high internal strain. This indicates instability and synthetic difficulty.
- We can therefore use the
csd.analysis.TorsionAnalyser.analyse_molecule
method from the Mogul library from the Cambridge Structural Database (CSD) software suite, accessible by the CSD Python API- Export the ligand from the docked pose as SDF and load the ligand using
ligand = sd.io.MoleculeReader('path/to/your/ligand.sdf')[0]
. - The core of this analysis is the
csd.analysis.TorsionAnalyser
. This tool uses data from the Mogul library, which is derived from the CSD, to assess the likelihood of observed torsion angles. - Running
analysed_torsions = csd.analysis.TorsionAnalyser.analyse_molecule
will give theanalysed_torsions
object containing a list of all the rotatable bond torsions analyzed asTorsion
objects, each having aclassification
attribute. - This
classification
attribute can have values such as “Common”, “Infrequent”, “Unusual” (or “Allowed”, “Outlier”, “Unknown”). - We can then count the number of “Unusual” or “Outlier” for a given ligand, and if it is above a certain threshold, we can drop it.
- Export the ligand from the docked pose as SDF and load the ligand using
- PAINS Filter:
- Pan-assay interference compounds (PAINS) are chemical compounds that often give false positive results in high-throughput screens.
- PAINS tend to react nonspecifically with numerous biological targets rather than specifically affecting one desired target, and several disruptive functional groups are shared by many PAINS.
- We can use RDKit’s built-in SMARTS catalogues to filter out any ligands with PAINS substructures:
- For each ligand SMILES, convert to
Mol
object viardkit.Chem.MolFromSmiles(ligand)
- Initialise RDKit PAINS filter catalogue via
params = rdkit.Chem.FilterCatalog.FilterCatalogParams()
andparams.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS)
- Then, create the
FilterCatalog
object viacatalog = FilterCatalog(params)
. - We then iterate through each of our ligand
Mol
objects, passing them tomatches = catalog.GetMatches(mol)
. Ifmatches
is notNone
, a PAINS alert is fired. - We can then drop that ligand, and log the details:
rejected_alerts = [] for match in matches: Alert_name = match.GetDescription() Smiles = Chem.MolToSmiles(mol) rejected_alerts.append({‘SMILES’: smiles, ‘Alert’: alert_name}) print(f"Alert fired for molecule {SMILES}: {alert_name}")
- For each ligand SMILES, convert to
- Brenk Filter:
- The Brenk filter comes from the paper "Lessons Learnt from Assembling Screening Libraries for Drug Discovery for Neglected Diseases" by Brenk et al., 2 and removes molecules containing substructures with undesirable pharmacokinetics or toxicity.
- To implement this, we can simply add the Brenk filter to the RDKit filter catalogue when implementing the PAINS filter:
params.AddCatalog(FilterCatalogParams.FilterCatalogs.BRENK
.
Final-Stage Clustering: From Exploration to Exploitation Throughout this pipeline, the strategic goal has evolved:
- Initial Stage (Exploration):
- When we created the Druglike centroid library, the goal was maximum exploration. We started with ~494 million molecules and clustered them to get ~13 million diverse representatives.
- The purpose was explicitly to ensure that the model was exposed to a wide range of chemical space.
- Final Stage (Exploitation & Efficiency):
- After the active learning loop converges and we have run high-precision VSH docking, we have a list of potentially 50,000-100,000 top-ranked compounds. It is financially and logistically impossible to synthesize and test all of them.
- Therefore, the goal shifts to exploitation. We now must select a manageable number (e.g., 500 – 2,000) that has the highest probability of success.
However, when reducing the top-ranked candidates we must balance potency with diversity. A method to do this is to apply Butina clustering at Tanimoto 0.6 and keep the ligand with best VSH ΔG at each cluster This approach intelligently balances the two competing priorities:
- Diversity (The Clustering):
- If we simply took the top 500 compounds by VSH score, we might end up with 500 molecules that are slight variations of the same 5 or 6 chemical scaffolds.
- This is an inefficient use of resources because they will likely have imilar biological activity and properties (a concept known as an "activity cliff").
- By clustering the compounds based on structural similarity (Tanimoto similarity), we group these variations together.
- Potency (The Selection):
- Within each structural cluster, we then select the single compound with the best VSH ΔG score.
- This member is considered the best representative of its chemical family.
By taking one representative from each cluster, we ensure that the final selection of 500 - 2,000 compounds is composed of structurally distinct molecules, each with the highest predicted potency in its class.