Predicting Suicide Rates for a Population Group

About

This is a Mini-Project for SC1015 Lab Group SC1 Project Group 5 (Introduction to Data Science and Artificial Intelligence) whicg focuses on Suicide Rates from The WHO Country Suicide Database. For detailed walkthrough, please view the source code in order from:

Contributors

@syaz1affandi (Syazwan Affandi Bin Mohd Saleh U2122128C)- Data Extraction, Data Cleaning
@adzikrafi (Muhammad Rafi Adzikra Sujai U2120731G)- Exploratory Data Analysis, Data Visualization
@BernardLesley (Bernard Lesley Efendy U2120902J)- Machine Learning (KNN Regressor, Linear Regression, Random Forest Regressor, etc)

Problem Definition

Are we able to predict the suicide rates based on a specific attributes of a population group that we think might explain suicide rates?
Which model would be the best to predict it?

Models Used

Multiple Linear Regression
K Nearest Neighbors Regressor
Random Forest regressor
Gradient Boosting Regressor
Voting Regressor

Conclusion

Each variable has low correlation with the suicide rates and thus, we cannot rely on a single variable to predict suicide rates
However, when we include all the variables that we have and fine-tuned our models, the accuracy increases dramatically.
We feel that suicide consist of complex interplay between some variables, that individually does not amount to suicide, just like the swiss cheese model
Our best models can achieve an accuracy score of 0.85, which is a great model performance score in predicting suicide rates.
However, there must be something else that explains suicide. Therefore, the models can still be improved given more data and variables to work with
We believe that this “something else” is the microscopic factor that varies from every individual. Our model only take into account macroscopic factor of a country, but fail to consider the uniqueness of each Individual.
It is important to do Scaling, for example using MinMaxScaler() and Hyperparameter Tuning, for example using GridSearchCV() for some regression models.
Gradient Boosting Regressor is the best model to predict suicide rates that can achieve 0.85 score, while Multiple Linear Regression is the least accurate in predicting suicide rates.

What did we learn from this project?

Merging Two datasets that have imabalanced number of rows and find the common subset for both datasets
Scaling the dataset to achieve better machine learning result, for example MinMaxScaler()
Grid Search Cross-Validation to select the best hyperparameters for a machine learning model
Feature Importance and Permutation Importance to rank the “importance” of features in a regression model
K Nearest Neighbors Regressor
Random Forest Regressor
Gradient Boosting Regressor
Voting Regressor

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
README.md		README.md
SC1015 Slides.pptx		SC1015 Slides.pptx
clean_data.csv		clean_data.csv
data_cleaning.ipynb		data_cleaning.ipynb
exploratory_data_analysis.ipynb		exploratory_data_analysis.ipynb
machine_learning.ipynb		machine_learning.ipynb
master.csv		master.csv
world-happiness-report.csv		world-happiness-report.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Predicting Suicide Rates for a Population Group

About

Contributors

Problem Definition

Models Used

Conclusion

What did we learn from this project?

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

BernardLesley/Data_Analysis_Suicide_Rate

Folders and files

Latest commit

History

Repository files navigation

Predicting Suicide Rates for a Population Group

About

Contributors

Problem Definition

Models Used

Conclusion

What did we learn from this project?

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages