Skip to content

lich-uct/dream-challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

First DREAM Target 2035 Drug Discovery Challenge - UCT Prague Submission

In this repository, we provide the code and data used for our submission to the first DREAM Target 2035 Drug Discovery Challenge focused on the WDR91 target. The challenge involved selecting compounds from a screening collection based on literature and DEL (DNA-encoded library) data, with a focus on integrating feature engineering, data augmentation and both automated and manual prioritization strategies.

Introduction

All data lives in the ./data/ directory. This is not included in the repository and you will need to download it from the associated Zenodo entry. We were also not able to provide the ./data/ena.smi file, which is the catalogue of compounds screened in the last step of the challenge. You will need to download the screening collection from the Enamine website. Note that the data is not in the same format as the one used in the challenge, so you will need to convert it to the SMILES,ID (no header) format and save it as ./data/ena.smi. We did this using the notebook clean_smis.ipynb.

Before you begin, there is also some additional code for data exploration in explore.ipynb, which contains potential interesting insights about the data and also some of the generated models. It also mentions a few preliminary scripts that you should execute in order to prepare data for some of the next steps.

Running the code

This will walk you through the notebooks and scripts that need to be executed to reproduce the submission files for each step of the challenge.

Step 1

Submission 1

This submission is a simple XGBoost model trained on just literature data and nothing else. It does not use the DEL data at all. We tried to use this as a baseline to evaluate all the other models, but in the end it turned as the best approach to use with XGBoost. Therefore, we further developed it in the second step instead of all the other models. Both training and predictions is facilitated through this notebook:

The notebook also contains models based on thresholded DEL data (on TARGET_VALUE -> to reduce noise from low activity compounds) combined with the literature data. This was better than using the DEL data alone, but still resulted in worse anticipated performance based on our test sets.

Submission 2

This submission is a simple XGBoost model trained on literature data and DEL data combined. It uses the same data as the previous submission, but also adds univariate filtering of features to make the model as simple as possible. This was done to potentially decrease the complexity of the model and be less sensitive to noise in the DEL data. The training and generation of submission files is facilitated through this notebook:

In the end, we did not see much difference in performance on our test sets, but we still decided to include them in the submission so that we have at least one model based on the DEL data.

Submission 3

This was a bit of a weird submission where we tried to use the UMAP embeddings from explore.ipynb to train a KNN model in the reduced space. It was not very successful on our test data, but we still decided to include it in the submissions since the composition of the official test data suggested that a simple clustering-based approach could work and we were aware of biases in our test data as well. All details including training and the generation of the submission file can be found in the notebook:

Step 2

In this case, the SMILES of the test set became available so we could try more methods based on molecular graphs or 3D representations.

Submission 1

This submission is based on RDKit alignment functionality. First, 50 conformers are generated using ETKDGv3 via embedder.py. Then, these conformers are compared with 7 PDB references (data/pdb_mols.sdf) from crystal structures using Shape and Color Tanimoto. Finally, the ranking and ensembling based on Color is done using the notebook Align3D_ranking.ipynb.

Submission 2

We tried to refine and augment the literature XGB model from the first step, submission 1, by using bioisoteric replacements (manually extracted from the literature) to potentially enhance its ability to pick up more diverse compounds and compensate for the small data set. The training and generation of submission files is facilitated through this notebook:

Submission 3

Here we tried to enhance the large XGB model using the DEL data with new shape-based descriptors that were obtained in a funny way by traininng regression models on shape-based similarity scores of the official test data to the PDB references from submission 1 of this step. The regressors were then used to predict those scores on the training set to obtain an additional feature set, which could enable us to leverage extra 3D information from the DEL data without the need to unblind it. The training and generation of submission files is facilitated through this notebook:

The approach did not yield satisfactory results on the test set, but we decided to include it anyway because of its unorthodox nature.

Step 3

In this step, only one submission is made, which consists of a final pick of compounds from the enamine catalogue. Based on the success of the previous methods, we decided to use a combination of shape-based alignment to references, an even more enhanced bioisosteric augmentation for the xgboost model and two small sets of manually selected compounds based on human intuition. Several submission files are generated that are then filtered and combined into one final set of compounds to manually prioritize and purchase.

Shape-based alignment

The following script generates the submission input file for shape-based alignment:

Enhanced bioisosteric augmentation

  • get_step_3_train_set.ipynb - Use this notebook to use the original agumentation strategy from step 2, submission 2, to generate a new training set from the set of literature compounds and the uncovered set of test molecules in step 2.
  • augment3.py - Perform additional augmentation using heterocyclic replacements and generate the final training set for step 3.
  • train_augmented3.ipynb - Trains the final XGBoost model on the augmented training set and generates the model file for prediction of the Enamine catalogue.
  • predict_augmented3.py - Predicts the Enamine catalogue using the final XGBoost model and generates the submission file.

Manual selection

A combination of SMARTS matching and manual selection was used to pick an interesting set of compounds based on the analysis of crystal structures and our experience during the challenge. The SMARTS matching is done in the following notebook:

  • step3_handpicked.ipynb - This notebook runs the bespoke custom SMARTS pattern search on enamine
  • step3_handpicked.csv - This file contains the manually selected compounds based on the analysis of crystal structures and our experience during the challenge.

Final prioritization

The submission files above were combined and a dimensionality reduction technique was used to project a chemical space map out of which the final compounds were selected. The following notebooks were used:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •