Skip to content

vaniamiguel13/SIB

Repository files navigation

SIB

Authors:

Repository Guide:

  • Files (folder) -> contains files generated during the development of this project (.txt and .csv); the following files could not be uploaded due to their large size:
    • data_train.csv -> training data (after feature extraction)
    • data_test.csv -> test data (after feature extraction)
    • X_train_sc.csv -> scaled training data (minmax scaler)
    • X_train_sc_z.csv -> scaled training data (standard scaler)
  • Experimental (folder) -> contains experimental procedures (.ipynb)
    • GBR_clust.ipynb (file) -> train distinct models based on a previous clustering analysis
    • MLP_hyper.ipynb (file) -> optimize MLP hyper-parameters
  • feature_extraction.py (file) -> contains functions for feature extraction
  • C_P_FE.ipynb (file) -> context, preprocessing and feature extraction
  • AE.ipynb (file) -> exploratory analysis
  • UL.ipynb (file) -> unsupervised learning
  • SL.ipynb (file) -> supervised learning
  • DL_MLP.ipynb (file) -> deep learning (multilayer perceptron)
  • DL_CNN.ipynb (file) -> deep learning (convolutional neural network)

Content Guide:

  • Contextualization
  • Preprocessing
  • Feature Extraction
  • Exploratory Analysis
  • Unsupervised Learning
  • Supervised Learning (shallow learning)
  • Deep learning

The data used in this report was retrived from KAGGLE: https://www.kaggle.com/c/novozymes-enzyme-stability-prediction
Each data example consists of a protein sequence, a pH value and thermostability index. Predicting the thermostability is fundamental in enzyme engeneering for a wide variety of applications. Employing ML techniques is of great value to achieve the latter purpose as it saves time and money.

  • Data processing
  • Identification of the train and test datasets
  • Treatment of Missing values
  • Identification of the Descriptors with BioPython and ProPy
  • Outlier Treatment (dependent variable)
  • Standardization
  • Thermostability value distribution
  • Correlation Analysis (Pearson, Spearman, f_regression, mutual_information)
  • Multicollinearity Analysis
  • PCA
  • tSNE
  • K-means
  • Cross-validation, Hyper-parameter tuning, and Model selection
  • Linear Regression
  • K-Nearest Neighbors
  • Support Vector Machine
  • Random Forest
  • Adaptive Boosting
  • Gradient Boosting

Deep learning

Credits:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •