First DREAM Target 2035 Drug Discovery Challenge - UCT Prague Submission

In this repository, we provide the code and data used for our submission to the first DREAM Target 2035 Drug Discovery Challenge focused on the WDR91 target. The challenge involved selecting compounds from a screening collection based on literature and DEL (DNA-encoded library) data, with a focus on integrating feature engineering, data augmentation and both automated and manual prioritization strategies.

Introduction

All data lives in the ./data/ directory. This is not included in the repository and you will need to download it from the associated Zenodo entry. We were also not able to provide the ./data/ena.smi file, which is the catalogue of compounds screened in the last step of the challenge. You will need to download the screening collection from the Enamine website. Note that the data is not in the same format as the one used in the challenge, so you will need to convert it to the SMILES,ID (no header) format and save it as ./data/ena.smi. We did this using the notebook clean_smis.ipynb.

Before you begin, there is also some additional code for data exploration in explore.ipynb, which contains potential interesting insights about the data and also some of the generated models. It also mentions a few preliminary scripts that you should execute in order to prepare data for some of the next steps.

Running the code

This will walk you through the notebooks and scripts that need to be executed to reproduce the submission files for each step of the challenge.

Step 1

Submission 1

This submission is a simple XGBoost model trained on just literature data and nothing else. It does not use the DEL data at all. We tried to use this as a baseline to evaluate all the other models, but in the end it turned as the best approach to use with XGBoost. Therefore, we further developed it in the second step instead of all the other models. Both training and predictions is facilitated through this notebook:

clean_models_th.ipynb

The notebook also contains models based on thresholded DEL data (on TARGET_VALUE -> to reduce noise from low activity compounds) combined with the literature data. This was better than using the DEL data alone, but still resulted in worse anticipated performance based on our test sets.

Submission 2

This submission is a simple XGBoost model trained on literature data and DEL data combined. It uses the same data as the previous submission, but also adds univariate filtering of features to make the model as simple as possible. This was done to potentially decrease the complexity of the model and be less sensitive to noise in the DEL data. The training and generation of submission files is facilitated through this notebook:

clean_models_th_feat_selection.ipynb

In the end, we did not see much difference in performance on our test sets, but we still decided to include them in the submission so that we have at least one model based on the DEL data.

Submission 3

This was a bit of a weird submission where we tried to use the UMAP embeddings from explore.ipynb to train a KNN model in the reduced space. It was not very successful on our test data, but we still decided to include it in the submissions since the composition of the official test data suggested that a simple clustering-based approach could work and we were aware of biases in our test data as well. All details including training and the generation of the submission file can be found in the notebook:

umap_knn.ipynb

Step 2

In this case, the SMILES of the test set became available so we could try more methods based on molecular graphs or 3D representations.

Submission 1

This submission is based on RDKit alignment functionality. First, 50 conformers are generated using ETKDGv3 via embedder.py. Then, these conformers are compared with 7 PDB references (data/pdb_mols.sdf) from crystal structures using Shape and Color Tanimoto. Finally, the ranking and ensembling based on Color is done using the notebook Align3D_ranking.ipynb.

Submission 2

We tried to refine and augment the literature XGB model from the first step, submission 1, by using bioisoteric replacements (manually extracted from the literature) to potentially enhance its ability to pick up more diverse compounds and compensate for the small data set. The training and generation of submission files is facilitated through this notebook:

bioisosteric_augment.ipynb

Submission 3

Here we tried to enhance the large XGB model using the DEL data with new shape-based descriptors that were obtained in a funny way by traininng regression models on shape-based similarity scores of the official test data to the PDB references from submission 1 of this step. The regressors were then used to predict those scores on the training set to obtain an additional feature set, which could enable us to leverage extra 3D information from the DEL data without the need to unblind it. The training and generation of submission files is facilitated through this notebook:

Step_2_ecfp4_similarity.ipynb

The approach did not yield satisfactory results on the test set, but we decided to include it anyway because of its unorthodox nature.

Step 3

In this step, only one submission is made, which consists of a final pick of compounds from the enamine catalogue. Based on the success of the previous methods, we decided to use a combination of shape-based alignment to references, an even more enhanced bioisosteric augmentation for the xgboost model and two small sets of manually selected compounds based on human intuition. Several submission files are generated that are then filtered and combined into one final set of compounds to manually prioritize and purchase.

Shape-based alignment

The following script generates the submission input file for shape-based alignment:

align_round3.py

Enhanced bioisosteric augmentation

get_step_3_train_set.ipynb - Use this notebook to use the original agumentation strategy from step 2, submission 2, to generate a new training set from the set of literature compounds and the uncovered set of test molecules in step 2.
augment3.py - Perform additional augmentation using heterocyclic replacements and generate the final training set for step 3.
train_augmented3.ipynb - Trains the final XGBoost model on the augmented training set and generates the model file for prediction of the Enamine catalogue.
predict_augmented3.py - Predicts the Enamine catalogue using the final XGBoost model and generates the submission file.

Manual selection

A combination of SMARTS matching and manual selection was used to pick an interesting set of compounds based on the analysis of crystal structures and our experience during the challenge. The SMARTS matching is done in the following notebook:

step3_handpicked.ipynb - This notebook runs the bespoke custom SMARTS pattern search on enamine
step3_handpicked.csv - This file contains the manually selected compounds based on the analysis of crystal structures and our experience during the challenge.

Final prioritization

The submission files above were combined and a dimensionality reduction technique was used to project a chemical space map out of which the final compounds were selected. The following notebooks were used:

step3_prioritization.ipynb - Dimensionality reduction and preparation of input file for manual prioritization.
prep_for_final_submission.ipynb - Final filtering of the top picks so that they adhere to the rules of the challenge and generation of the final submission file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

First DREAM Target 2035 Drug Discovery Challenge - UCT Prague Submission

Introduction

Running the code

Step 1

Submission 1

Submission 2

Submission 3

Step 2

Submission 1

Submission 2

Submission 3

Step 3

Shape-based alignment

Enhanced bioisosteric augmentation

Manual selection

Final prioritization

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
step_2_similarity		step_2_similarity
submissions		submissions
.gitignore		.gitignore
Align3D_ranking.ipynb		Align3D_ranking.ipynb
README.md		README.md
SMARTS_analysis.ipynb		SMARTS_analysis.ipynb
align2.py		align2.py
align_round3.py		align_round3.py
augment3.py		augment3.py
bioisosteric_augment.ipynb		bioisosteric_augment.ipynb
clean_models_th.ipynb		clean_models_th.ipynb
clean_models_th_feat_selection.ipynb		clean_models_th_feat_selection.ipynb
clean_smis.ipynb		clean_smis.ipynb
compounds_fps.py		compounds_fps.py
embedder.py		embedder.py
explore.ipynb		explore.ipynb
explore.py		explore.py
get_step_3_train_set.ipynb		get_step_3_train_set.ipynb
make_umap.py		make_umap.py
predict_augmented3.py		predict_augmented3.py
prep_for_final_submission.ipynb		prep_for_final_submission.ipynb
requirements.txt		requirements.txt
step3_handpicked.ipynb		step3_handpicked.ipynb
step3_prioritization.ipynb		step3_prioritization.ipynb
train_augmented3.ipynb		train_augmented3.ipynb
umap_knn.ipynb		umap_knn.ipynb
utils.py		utils.py
writeup.md		writeup.md

lich-uct/dream-challenge

Folders and files

Latest commit

History

Repository files navigation

First DREAM Target 2035 Drug Discovery Challenge - UCT Prague Submission

Introduction

Running the code

Step 1

Submission 1

Submission 2

Submission 3

Step 2

Submission 1

Submission 2

Submission 3

Step 3

Shape-based alignment

Enhanced bioisosteric augmentation

Manual selection

Final prioritization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages