School of Computer Science and Engineering
Nanyang Technological University
Lab: FCE1
Group : 3
Members:
- Isa Bin Mohamed Yamin (@boiwantlearncode)
- Brandon Chiu Ka Shen (@ShyIDE)
- Jayakumar Thirunithiyan (@Thiru-Nithiyan)
This repository contains all the Jupyter Notebooks, datasets, images, video presentations, and the source materials/references we have used and created as part of our Mini Project for SC1015.
Raw
Post-Processing
- /datasets/data_btc.csv
- /datasets/expanded2_usa_inflation_rate.csv
- How can we use volume and USA interest rate to predict bitcoin prices?
- What other factors of Bitcoin's technical indicators can help us further predict the trend of Bitcoin?
- What kind of models can we use that most accurately predicts the price of Bitcoin?
Initially, we sourced our datasets from 2 sources as mentioned in the Raw section. Afterwards, we cleaned up the datasets manually using Excel Spreadsheet. Here's what we did:
- Conversion of date format: From mm/dd/yyyy in 'raw/new_inflation_rate_cpi.xlsx' to dd/mm/yyyy
- Extrapolation of data points: The raw dataset only includes one data point per month. For a larger range of predictions, we expanded the number of data points by extrapolating each data point per month across each respective month.
Following that, we cleaned up the datasets using Python. Here's what we did:
- Preprocessing with
pd.to_datetime
: Thepd.to_datetime
function converts the 'date' column in bothinflation_data
andbitcoin_data
DataFrames to datetime objects. The parameterdayfirst=True
specifies that the date format starts with the day, followed by the month and year. - Merging DataFrames: The code merges the
bitcoin_data
andinflation_data
DataFrames on the 'date' column using thepd.merge
function, performing an inner join to retain only the rows with matching dates. - Handling NaN values: Any missing values in the DataFrame (
merged_data
) are filled with the mean of their respective columns using thefillna
method. This step is done here
-
Plotting Correlation Matrix Heatmap: Finally, a correlation matrix heatmap is plotted using seaborn (
sns
) and matplotlib (plt
). This heatmap visualizes the correlation between different numerical variables in the dataset. The correlation coefficients are displayed as annotations on the heatmap. -
Statistical Overview: Statistical overview such as mean, median and mode of each variable.
-
Plotting Statistical Overview: Boxplot, histogram and violin plots of each variable.
Model used: Linear Regression K-Fold Cross Validation Technique
Detailed Discussion on Variable Selection:
- Code for Correlation Analysis: Include snippets that calculate and visualize the correlations between Bitcoin prices and other variables like volume and inflation rates.
- Feature Selection: Code showing how features were chosen based on correlation strength and practical relevance.
Machine Learning Techniques:
- Data Preprocessing for Modeling: Include code snippets for any additional data transformation or scaling. Model Implementation:
- Linear Regression: Code to implement and fit a simple linear regression model as a baseline.
- Random Forest: Code for training a Random Forest model including parameter settings.
- Gradient Boosting: Setup and training of a Gradient Boosting model.
- XGBoost: Implementation of the XGBoost model, noting any special parameters or optimization techniques used.
Presentation of Insights and RMSE Discussion:
- Model Evaluation: Code for calculating the RMSE (Root Mean Square Error) for each model.
- Comparison Plot: Code for generating plots that compare the RMSE of different models to visualize their performance.
Future Directions and Recommendations: Suggestions for Model Enhancement:
- Code for LSTM Implementation: If available, include a basic setup for an LSTM model as a suggestion for future exploration.
- Code for Combining Models: Example snippets where models like XGBoost and LSTM might be combined for ensemble predictions.
In summary, this research not only sheds light on the predictive relationships between Bitcoin prices and its economic and technical indicators but also demonstrates the effectiveness of advanced modeling techniques. Thank you for your attention, and I look forward to any questions or further discussion on this topic.
- Different ML models concerning numeric data such as Random Forest, Gradient Boosting and XGBoost that are not within the scope of our course
- Preventing overfitting in numeric models
- Extrapolate datapoints
- Predictors that we had not known to have strong correlation with Bitcoin price
- Manipulating data such as merging dataframes
- https://machinelearningmastery.com/k-fold-cross-validation/
- https://builtin.com/data-science/random-forest-algorithm
- https://www.analyticsvidhya.com/blog/2021/09/gradient-boosting-algorithm-a-complete-guide-for-beginners/
- https://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understand-the-math-behind-xgboost/