R&D Collaboration with BSBE, IIT Bombay
This repository contains the work conducted as part of an R&D project in collaboration with the Department of Biosciences and Bioengineering (BSBE) at IIT Bombay. The primary focus of this project is to advance the identification of druggable proteins by employing a variety of machine learning and deep learning techniques, feature extraction strategies, and bioinformatics tools.
The repository is organized into various folders, each containing different aspects of the research:
- CD_HIT_Clustering: Contains results from clustering non-druggable proteins using the CD-HIT tool.
- Data_Extraction: Scripts and tools used for scraping and extracting data relevant to the project.
- Deep_Learning_Models: Implementation of deep learning models, including autoencoders for sequence embeddings.
- Domains_Extraction: Data and scripts related to the extraction and analysis of protein domains.
- Genetic_Algorithm_Analysis: Contains analysis scripts and results using genetic algorithms for feature selection.
- Multi_Modal_Pipeline_Analysis: Multi-modal pipeline analysis combining different techniques to enhance druggability prediction.
- NLP_Analysis: Natural Language Processing (NLP) scripts for analyzing protein sequences and related data.
- PCA_Analysis: Principal Component Analysis (PCA) applied to various datasets to reduce dimensionality and improve model performance.
- PCP_Features_JSON: JSON files containing feature extraction results from protein characteristics and properties (PCP).
- PCP_Subcellular_Analysis: Analysis related to the subcellular localization of proteins.
- Protein_Sequence_Encodings: Sequence encodings and embeddings for proteins using various encoding algorithms.
- Result_Compilation: Scripts for compiling and summarizing results from different models and analyses.
- UNIPROT_SwissProt_Files: Contains files extracted from the UniProt/Swiss-Prot database for protein analysis.
- Genetic Algorithms: Used for feature selection and optimization in predicting druggability.
- Deep Learning: Implementations of autoencoders and sequence models to learn representations of protein sequences.
- Clustering Algorithms: CD-HIT tool for clustering and analyzing non-druggable proteins.
- NLP Techniques: TF-IDF, Word2Vec embeddings, and other NLP methods applied to protein sequence data.
- PCA and Dimensionality Reduction: To enhance model interpretability and performance.
- Interpretable Machine Learning: Use of SHAP analysis for model interpretability.
- Multi-Modal Pipeline: Integration of various modalities to improve the accuracy of druggability predictions.
This project is licensed under the MIT License - see the LICENSE file for details.
For any inquiries or collaboration opportunities, please reach out to Sahil Dhanraj Barbade via GitHub.
This project is a result of a collaborative effort with the Department of Biosciences and Bioengineering (BSBE), IIT Bombay.
Tags: genetic-algorithm
, partitioning-algorithms
, feature-extraction
, drug-discovery
, tf-idf
, feature-engineering
, clustering-algorithm
, sequence-models
, ensemble-machine-learning
, cd-hit
, interpretable-machine-learning
, word2vec-embeddings
, druggability
, autoencoder-neural-network
, nlp-deep-learning
, majority-voting
, encoding-algorithms
, shap-analysis