Skip to content

SahilBarbade1203/Druggability_Research_Analysis

Repository files navigation

Druggability_Research_Analysis

R&D Collaboration with BSBE, IIT Bombay

Overview

This repository contains the work conducted as part of an R&D project in collaboration with the Department of Biosciences and Bioengineering (BSBE) at IIT Bombay. The primary focus of this project is to advance the identification of druggable proteins by employing a variety of machine learning and deep learning techniques, feature extraction strategies, and bioinformatics tools.

Project Structure

The repository is organized into various folders, each containing different aspects of the research:

  • CD_HIT_Clustering: Contains results from clustering non-druggable proteins using the CD-HIT tool.
  • Data_Extraction: Scripts and tools used for scraping and extracting data relevant to the project.
  • Deep_Learning_Models: Implementation of deep learning models, including autoencoders for sequence embeddings.
  • Domains_Extraction: Data and scripts related to the extraction and analysis of protein domains.
  • Genetic_Algorithm_Analysis: Contains analysis scripts and results using genetic algorithms for feature selection.
  • Multi_Modal_Pipeline_Analysis: Multi-modal pipeline analysis combining different techniques to enhance druggability prediction.
  • NLP_Analysis: Natural Language Processing (NLP) scripts for analyzing protein sequences and related data.
  • PCA_Analysis: Principal Component Analysis (PCA) applied to various datasets to reduce dimensionality and improve model performance.
  • PCP_Features_JSON: JSON files containing feature extraction results from protein characteristics and properties (PCP).
  • PCP_Subcellular_Analysis: Analysis related to the subcellular localization of proteins.
  • Protein_Sequence_Encodings: Sequence encodings and embeddings for proteins using various encoding algorithms.
  • Result_Compilation: Scripts for compiling and summarizing results from different models and analyses.
  • UNIPROT_SwissProt_Files: Contains files extracted from the UniProt/Swiss-Prot database for protein analysis.

Key Features and Techniques

  • Genetic Algorithms: Used for feature selection and optimization in predicting druggability.
  • Deep Learning: Implementations of autoencoders and sequence models to learn representations of protein sequences.
  • Clustering Algorithms: CD-HIT tool for clustering and analyzing non-druggable proteins.
  • NLP Techniques: TF-IDF, Word2Vec embeddings, and other NLP methods applied to protein sequence data.
  • PCA and Dimensionality Reduction: To enhance model interpretability and performance.
  • Interpretable Machine Learning: Use of SHAP analysis for model interpretability.
  • Multi-Modal Pipeline: Integration of various modalities to improve the accuracy of druggability predictions.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any inquiries or collaboration opportunities, please reach out to Sahil Dhanraj Barbade via GitHub.

Acknowledgments

This project is a result of a collaborative effort with the Department of Biosciences and Bioengineering (BSBE), IIT Bombay.


Tags: genetic-algorithm, partitioning-algorithms, feature-extraction, drug-discovery, tf-idf, feature-engineering, clustering-algorithm, sequence-models, ensemble-machine-learning, cd-hit, interpretable-machine-learning, word2vec-embeddings, druggability, autoencoder-neural-network, nlp-deep-learning, majority-voting, encoding-algorithms, shap-analysis