Skip to content

A ML model build base on the dataset provided by AlphaFold about protein 3D structure to determine which part of the protein is able to bind to pharmaceutical drugs

Notifications You must be signed in to change notification settings

w12l3-c/Drug-Binding-Protein-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Drug-Binding-Protein

Open In Colab

A ML model build base on the dataset provided by AlphaFold about protein 3D structure to determine which part of the protein is able to bind to pharmaceutical drugs

In the Notebook, I have compared multiple models such as XGBoost, LightBGM and K-Nearest. Since this is an extremely unbalanced dataset, classifying the as much true positive is more important so false positive > false negative is preferred.

In this unbalanced datatse, there are multiple ways to solve this. After a lot of testing base on the accuracy report in sklearn (F1 score and Precision), using class weights is better than using Oversampling or Undersampling method like SMOTE, SVMSMOTE, and NearMiss. With a F1 score of positive class of 37% while negative class of 98%. The ROC is less important in the usage of class weights because the datatset is imbalanced.

After all the training and evaluation of model's performance. There is another datatset without labels and will let the model predict their compatibility with drug binding.

Dataset:
https://drive.google.com/file/d/1H6oqtp9buAjO8NKQEW_jDzRd-4-qgQPF/view?usp=sharing
https://drive.google.com/file/d/1pr2_xiH7gEOnPtg8yqSevZDF_l0ak387/view?usp=sharing

About

A ML model build base on the dataset provided by AlphaFold about protein 3D structure to determine which part of the protein is able to bind to pharmaceutical drugs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published