Materials and problem sets for the course Machine Learning for Molecular Engineering taught at MIT.
Instructors: Prof. Connor Coley, Prof. Rafael Gomez-Bombarelli, Prof. Ernest Fraenkel, Prof. Joey Davis
Warning: These assignments are a work in progress and will change. Do not start working on an assignment until it has been released on Canvas.
All (3.C01/3.C51, 7.C01/7.C51, 10.C01/10.C51, 20.C01/20.C51)
Ungraded problem set to practice using Google Colab and NumPy.
Linear classification problem to get you started for the course. You will use logistic regression to diagnose cancer (data size: ~10^2), applying linear methods with L1 and L2 regularization and understand what effects they have on your regression results.
Perovskites (3.C01/3.C51, 10.C01/10.C51)
You will then apply a MLP regressor to predict properties of perovskites (data size: ~10^3) and compare differences between representations of perovskite crystal chemical compositions.
MHC (7.C01/7.C51, 20.C01/20.C51)
You will then apply an MLP regressor to predict MHC binding to peptides (data size: ~10^3) and compare differences between representations of peptide amino acid sequences.
For this problem set, you will build a sequence-based model to predict DNA binding sequences (data size: ~10^4) generated with ChIP-seq.
Bubbles (3.C01/3.C51, 10.C01/10.C51)
Next, you will train a model to perform segmentation on images of bubbles (data size: ~10^2) arising from the surface of a catalyst, with the goal of using these insights to improve the design of catalysts.
Cells (7.C01/7.C51, 20.C01/20.C51)
Next, you will train a model to perform segmentation on images of cells (data size: ~10^2), with the goal of using these insights to better understand cell morphology.
In this problem set, you will train a 2D Graph Neural Network to predict solubility from molecular features, and investigate its insufficiency in predicting chirally-aware properties, namely
Molecular Properties (3.C01/3.C51, 10.C01/10.C51)
In this problem set, you will train a Graph Neural Network (GNN) to predict synthetic lethal (SL) interactions between gene pairs.