- António Duarte | PG45464
- Roberto Bullitta | PG45474
- Vânia Miguel | PG45971
- Files (folder) -> contains files generated during the development of this project (.txt and .csv); the following files could not be uploaded due to their large size:
- data_train.csv -> training data (after feature extraction)
- data_test.csv -> test data (after feature extraction)
- X_train_sc.csv -> scaled training data (minmax scaler)
- X_train_sc_z.csv -> scaled training data (standard scaler)
- Experimental (folder) -> contains experimental procedures (.ipynb)
- GBR_clust.ipynb (file) -> train distinct models based on a previous clustering analysis
- MLP_hyper.ipynb (file) -> optimize MLP hyper-parameters
- feature_extraction.py (file) -> contains functions for feature extraction
- C_P_FE.ipynb (file) -> context, preprocessing and feature extraction
- AE.ipynb (file) -> exploratory analysis
- UL.ipynb (file) -> unsupervised learning
- SL.ipynb (file) -> supervised learning
- DL_MLP.ipynb (file) -> deep learning (multilayer perceptron)
- DL_CNN.ipynb (file) -> deep learning (convolutional neural network)
- Contextualization
- Preprocessing
- Feature Extraction
- Exploratory Analysis
- Unsupervised Learning
- Supervised Learning (shallow learning)
- Deep learning
The data used in this report was retrived from KAGGLE: https://www.kaggle.com/c/novozymes-enzyme-stability-prediction
Each data example consists of a protein sequence, a pH value and thermostability index. Predicting the thermostability is fundamental in enzyme engeneering for a wide variety of applications. Employing ML techniques is of great value to achieve the latter purpose as it saves time and money.
- Data processing
- Identification of the train and test datasets
- Treatment of Missing values
- Identification of the Descriptors with BioPython and ProPy
- Outlier Treatment (dependent variable)
- Standardization
- Thermostability value distribution
- Correlation Analysis (Pearson, Spearman, f_regression, mutual_information)
- Multicollinearity Analysis
- PCA
- tSNE
- K-means
- Cross-validation, Hyper-parameter tuning, and Model selection
- Linear Regression
- K-Nearest Neighbors
- Support Vector Machine
- Random Forest
- Adaptive Boosting
- Gradient Boosting
- Curricular Unit Slides
- Deep Learning: https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/