Druggability_Research_Analysis

R&D Collaboration with BSBE, IIT Bombay

Overview

This repository contains the work conducted as part of an R&D project in collaboration with the Department of Biosciences and Bioengineering (BSBE) at IIT Bombay. The primary focus of this project is to advance the identification of druggable proteins by employing a variety of machine learning and deep learning techniques, feature extraction strategies, and bioinformatics tools.

Project Structure

The repository is organized into various folders, each containing different aspects of the research:

CD_HIT_Clustering: Contains results from clustering non-druggable proteins using the CD-HIT tool.
Data_Extraction: Scripts and tools used for scraping and extracting data relevant to the project.
Deep_Learning_Models: Implementation of deep learning models, including autoencoders for sequence embeddings.
Domains_Extraction: Data and scripts related to the extraction and analysis of protein domains.
Genetic_Algorithm_Analysis: Contains analysis scripts and results using genetic algorithms for feature selection.
Multi_Modal_Pipeline_Analysis: Multi-modal pipeline analysis combining different techniques to enhance druggability prediction.
NLP_Analysis: Natural Language Processing (NLP) scripts for analyzing protein sequences and related data.
PCA_Analysis: Principal Component Analysis (PCA) applied to various datasets to reduce dimensionality and improve model performance.
PCP_Features_JSON: JSON files containing feature extraction results from protein characteristics and properties (PCP).
PCP_Subcellular_Analysis: Analysis related to the subcellular localization of proteins.
Protein_Sequence_Encodings: Sequence encodings and embeddings for proteins using various encoding algorithms.
Result_Compilation: Scripts for compiling and summarizing results from different models and analyses.
UNIPROT_SwissProt_Files: Contains files extracted from the UniProt/Swiss-Prot database for protein analysis.

Key Features and Techniques

Genetic Algorithms: Used for feature selection and optimization in predicting druggability.
Deep Learning: Implementations of autoencoders and sequence models to learn representations of protein sequences.
Clustering Algorithms: CD-HIT tool for clustering and analyzing non-druggable proteins.
NLP Techniques: TF-IDF, Word2Vec embeddings, and other NLP methods applied to protein sequence data.
PCA and Dimensionality Reduction: To enhance model interpretability and performance.
Interpretable Machine Learning: Use of SHAP analysis for model interpretability.
Multi-Modal Pipeline: Integration of various modalities to improve the accuracy of druggability predictions.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any inquiries or collaboration opportunities, please reach out to Sahil Dhanraj Barbade via GitHub.

Acknowledgments

This project is a result of a collaborative effort with the Department of Biosciences and Bioengineering (BSBE), IIT Bombay.

Tags: genetic-algorithm, partitioning-algorithms, feature-extraction, drug-discovery, tf-idf, feature-engineering, clustering-algorithm, sequence-models, ensemble-machine-learning, cd-hit, interpretable-machine-learning, word2vec-embeddings, druggability, autoencoder-neural-network, nlp-deep-learning, majority-voting, encoding-algorithms, shap-analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Druggability_Research_Analysis

Overview

Project Structure

Key Features and Techniques

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
CD_HIT_Clustering		CD_HIT_Clustering
Data_Extraction		Data_Extraction
Deep_Learning_Models		Deep_Learning_Models
Domains_Extraction		Domains_Extraction
Genetic Algorithm		Genetic Algorithm
Genetic_Algorithm_Analysis		Genetic_Algorithm_Analysis
Multi_Modal_Pipeline_Analysis		Multi_Modal_Pipeline_Analysis
NLP_Analysis		NLP_Analysis
PCA_Analysis		PCA_Analysis
PCP_Features_JSON		PCP_Features_JSON
PCP_Subcellular_Analysis		PCP_Subcellular_Analysis
Protein_Sequence_Encodings		Protein_Sequence_Encodings
Result_Compilation		Result_Compilation
UNIPROT_SwissProt_Files		UNIPROT_SwissProt_Files
LICENSE		LICENSE
README.md		README.md

License

SahilBarbade1203/Druggability_Research_Analysis

Folders and files

Latest commit

History

Repository files navigation

Druggability_Research_Analysis

Overview

Project Structure

Key Features and Techniques

License

Contact

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages