SIB

Authors:

António Duarte | PG45464
Roberto Bullitta | PG45474
Vânia Miguel | PG45971

Repository Guide:

Files (folder) -> contains files generated during the development of this project (.txt and .csv); the following files could not be uploaded due to their large size:
- data_train.csv -> training data (after feature extraction)
- data_test.csv -> test data (after feature extraction)
- X_train_sc.csv -> scaled training data (minmax scaler)
- X_train_sc_z.csv -> scaled training data (standard scaler)
Experimental (folder) -> contains experimental procedures (.ipynb)
- GBR_clust.ipynb (file) -> train distinct models based on a previous clustering analysis
- MLP_hyper.ipynb (file) -> optimize MLP hyper-parameters
feature_extraction.py (file) -> contains functions for feature extraction
C_P_FE.ipynb (file) -> context, preprocessing and feature extraction
AE.ipynb (file) -> exploratory analysis
UL.ipynb (file) -> unsupervised learning
SL.ipynb (file) -> supervised learning
DL_MLP.ipynb (file) -> deep learning (multilayer perceptron)
DL_CNN.ipynb (file) -> deep learning (convolutional neural network)

Content Guide:

Contextualization
Preprocessing
Feature Extraction
Exploratory Analysis
Unsupervised Learning
Supervised Learning (shallow learning)
Deep learning

Contextualization:

The data used in this report was retrived from KAGGLE: https://www.kaggle.com/c/novozymes-enzyme-stability-prediction
Each data example consists of a protein sequence, a pH value and thermostability index. Predicting the thermostability is fundamental in enzyme engeneering for a wide variety of applications. Employing ML techniques is of great value to achieve the latter purpose as it saves time and money.

Preprocessing:

Data processing
Identification of the train and test datasets
Treatment of Missing values

Feature Extraction

Identification of the Descriptors with BioPython and ProPy

Exploratory Analysis

Outlier Treatment (dependent variable)
Standardization
Thermostability value distribution
Correlation Analysis (Pearson, Spearman, f_regression, mutual_information)
Multicollinearity Analysis

Unsupervised Learning

PCA
tSNE
K-means

Supervised Learning

Cross-validation, Hyper-parameter tuning, and Model selection
Linear Regression
K-Nearest Neighbors
Support Vector Machine
Random Forest
Adaptive Boosting
Gradient Boosting

Deep learning

MLP
CNN

Credits:

Curricular Unit Slides
Deep Learning: https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SIB

Authors:

Repository Guide:

Content Guide:

Contextualization:

Preprocessing:

Feature Extraction

Exploratory Analysis

Unsupervised Learning

Supervised Learning

Deep learning

Credits:

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
Experimental		Experimental
Files		Files
AE.ipynb		AE.ipynb
C_P_FE.ipynb		C_P_FE.ipynb
DL_CNN.ipynb		DL_CNN.ipynb
DL_MLP.ipynb		DL_MLP.ipynb
README.md		README.md
SL.ipynb		SL.ipynb
UL.ipynb		UL.ipynb
feature_extraction.py		feature_extraction.py

vaniamiguel13/SIB

Folders and files

Latest commit

History

Repository files navigation

SIB

Authors:

Repository Guide:

Content Guide:

Contextualization:

Preprocessing:

Feature Extraction

Exploratory Analysis

Unsupervised Learning

Supervised Learning

Deep learning

Credits:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages