This is a Mini-Project for SC1015 Lab Group SC1 Project Group 5 (Introduction to Data Science and Artificial Intelligence) whicg focuses on Suicide Rates from The WHO Country Suicide Database. For detailed walkthrough, please view the source code in order from:
- @syaz1affandi (Syazwan Affandi Bin Mohd Saleh U2122128C)- Data Extraction, Data Cleaning
- @adzikrafi (Muhammad Rafi Adzikra Sujai U2120731G)- Exploratory Data Analysis, Data Visualization
- @BernardLesley (Bernard Lesley Efendy U2120902J)- Machine Learning (KNN Regressor, Linear Regression, Random Forest Regressor, etc)
- Are we able to predict the suicide rates based on a specific attributes of a population group that we think might explain suicide rates?
- Which model would be the best to predict it?
- Multiple Linear Regression
- K Nearest Neighbors Regressor
- Random Forest regressor
- Gradient Boosting Regressor
- Voting Regressor
- Each variable has low correlation with the suicide rates and thus, we cannot rely on a single variable to predict suicide rates
- However, when we include all the variables that we have and fine-tuned our models, the accuracy increases dramatically.
- We feel that suicide consist of complex interplay between some variables, that individually does not amount to suicide, just like the swiss cheese model
- Our best models can achieve an accuracy score of 0.85, which is a great model performance score in predicting suicide rates.
- However, there must be something else that explains suicide. Therefore, the models can still be improved given more data and variables to work with
- We believe that this “something else” is the microscopic factor that varies from every individual. Our model only take into account macroscopic factor of a country, but fail to consider the uniqueness of each Individual.
- It is important to do Scaling, for example using MinMaxScaler() and Hyperparameter Tuning, for example using GridSearchCV() for some regression models.
- Gradient Boosting Regressor is the best model to predict suicide rates that can achieve 0.85 score, while Multiple Linear Regression is the least accurate in predicting suicide rates.
- Merging Two datasets that have imabalanced number of rows and find the common subset for both datasets
- Scaling the dataset to achieve better machine learning result, for example MinMaxScaler()
- Grid Search Cross-Validation to select the best hyperparameters for a machine learning model
- Feature Importance and Permutation Importance to rank the “importance” of features in a regression model
- K Nearest Neighbors Regressor
- Random Forest Regressor
- Gradient Boosting Regressor
- Voting Regressor
- https://www.kaggle.com/datasets/russellyates88/suicide-rates-overview-1985-to-2016
- https://www.kaggle.com/code/lmorgan95/r-suicide-rates-in-depth-stats-insights
- https://towardsdatascience.com/the-suicide-crisis-in-data-7025f8551ca8
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
- https://www.analyticsvidhya.com/blog/2018/08/k-nearest-neighbor-introduction-regression-python/
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
- https://vitalflux.com/gradient-boosting-regression-python-examples/
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html
- https://scikit-learn.org/stable/auto_examples/ensemble/plot_voting_regressor.html#sphx-glr-auto-examples-ensemble-plot-voting-regressor-py