Build 2 machine learning models: a regression model to predict Black-Scholes option price and a classification model to predict whether Black-Scholes model overestimates or underestimate the actual option price.
Tools: Python (NumPy, Pandas, Scikit-learn, ...), MS Excel, PowerPoint
Skills: Exploratory Data Analysis, Applied Statistics, Data Visualization, Machine Learning, Feature Engineering, Project Management, Team Collaboration, Business Communication,
The dataset contains 1,680 records and 6 columns as followings:

Note
There are missing values and outliers in some fields of the data by looking at % Populated and Min, Max, and Mean. Therefore, we will considering dropping missing values and outliers before modeling step.
We decided to removed any record with missing value and with outliers that fell beyond 3 standard deviation from the mean of any field. Below is the detailed view of detected records with missing values and/or outliers in any field:

Note
We only dropped 7 over 1,680 records from the original data, which won’t be significant. There are 1,673 records after the data cleaning step.


Note
The distribution of field Stock Price (S) and field Time to Maturity (t) become clearer after dropping outliers and missing values.

Our objective for model exploration was to experiment with different models to select the best model for both regression and classification problems.
- In the regression problem, we wanted to train a model that can accurately predict the option price.
- In the classification problem, we wanted to build a model that can accurately classify whether using the Black-Scholes algorithm would underestimate or overestimate the actual option price.
We tried different combinations of these tuning hyperparameters to find the best performing models:

Below is our method to evaluate and select the best model for each problem:


Note
In general, non-linear models outperformed the baseline Linear Regression model significantly. Gradient Boosting Regression model performs the best with the highest and the least variability in testing and cross validation R-squared score. This means that this model is more consistent and robust.

Note
In general, non-linear models outperformed the baseline Logistic Regression by a little. Logistic Regression shows less sign of overfitting comparing to other models. CatBoost model performs the best with the highest and the least variability in testing and cross validation accuracy score. This means that this model is quite more accurate and robust than other models.

Some business understandings need to be considered when predicting option values:
- Accurately predicting European call option values is essential to achieve the most optimal financial outcomes, but the interpretation is also important for decision-making. Understanding the relationships between predictor variables and response variables can provide valuable insights to guide investment strategies, risk management, or policy decisions.
- Machine learning models can outperform the Black-Scholes model in predicting option prices due to their flexibility and adaptability. While the Black-Scholes model, primarily used in European option trading, relies on a fixed set of assumptions and features, machine learning models can capture complex patterns, nonlinear relationships, varying volatility, changing interest rates, and non-continuous trading scenarios. This enables machine learning models to achieve higher accuracy and greater practicality in real-world trading environments.
- Applied to predict Tesla’s option price? Tesla is very unique compared to other S&P 500 stock options. Due to its high volatility, the CEO’s sentiments, emerging industry dynamics, and growth expectations, predicting Tesla’s call option price using these existing patterns would be very challenging or yield poor performance.