These are the guided projects that I have worked on through the DataQuest platform. I'm still going through the curriculum of the Data Scientist career path and I will keep updating the repository with new projects.
- Analyzing NYC High School Data: Analyze the New York City public schools data to find out what are the reasons causing unfair advantages to certain groups in the US educational system.
- Clean and Analyze Employee Exit Surveys: The survey was taken by the Australian Department of Education, Training, and Employment (DETE) and the Technical and Further Education (TAFE) institute. The purpose of this survey was to assess the opinions and attitudes of departing employees and what were the main reasons for them to resign.
- Exploring Ebay Car Sales: Analysis on the German eBay website to determine the factors involved in the car prices. What are the most popular and least fortune cars on the website listings?
- Finding Heavy Traffic Indicators: The goal of this project is to analyze the I-94 traffic (from Minneapolis to Saint Paul, USA), then determine what are the indicators of heavy traffic.
- Hacker News: Analysis of the Hacker News posts to determine which type of post and time receive the most comments on average.
- Profitable App Recommendation: While it is possible that an app doing great on the App Store but not so well on Google Play. This project will help mobile app developers to understand what are the key elements to consider to make an app profitable on both platforms.
- Star Wars Survey: The project acquires a Star Wars to understand the thoughts of the fans towards the franchise. For example, demographic data to show which fan base is more than the other, which is the favorite Star Wars movie, who is the favorite actor, and more.
- Storytelling Data Visualization on Exchange Rates: The goal of this project is to explore the evolution of the exchange rates between EUR-USD and EUR-RUB during the Covid pandemic in 2020.
- Popular Data Science Questions: This project focuses on identifying the most sought-after topics in the field of data science by examining popular questions and content on the Stack Exchange network. The project is set in a business context where the goal is to create valuable data science content for the company. However, with the absence of clear instructions, the project allows for some flexibility and creativity in defining what "best" means. The objective of the project is guided by the passion for helping people learn and improve their data science skills. The project takes inspiration from the researcher's personal experience with learning programming and the popularity of certain programming topics on Stack Overflow.
- Investigating Fandango Movie Ratings: In this project, we analyze recent movie ratings data to determine if Fandango's rating system has changed since a 2015 analysis revealed it was biased. The previous analysis found that Fandango inflated movie ratings and displayed them differently than their actual value. Our goal is to see if this issue has been fixed and if Fandango's rating system is now more reliable.
- Finding Best Markets to Advertise: This project aimed to find the best two markets to advertise our programming courses. We analyzed survey data from new coders and concluded that the US would be a good market to target. However, choosing between India and Canada for the second market was not clear-cut. Therefore, we provided our results to the marketing team to make an informed decision based on their domain knowledge.
- Mobile App for Lottery Addiction: The project aimed at helping people make informed decisions about playing the lottery. The goal of the project is to create a mobile app that calculates the probability of winning the lottery under different scenarios, including playing one or several tickets, expecting to win the big prize or a smaller one, and using historical data to check if a combination of numbers has ever won before. The project provides insights into the probability of winning and the cost of playing, which can help players make informed decisions about playing the lottery.
- Spam Filter using Naive Bayes: The spam filter project used the multinomial Naive Bayes algorithm and a labeled dataset of 5,572 SMS messages to create a filter that can classify new messages as spam or ham. The accuracy of the filter was 98.74%, which exceeded the initial goal by almost 20%.
- Winning Jeopardy: This project analyzes a dataset of Jeopardy questions to identify trends and patterns in question topics, values, and difficulty levels. The analysis includes hypothesis testing using chi-squared tests. The project aims to uncover insights into the game's structure and offer recommendations for players.
- CIA Factbook Data Analysis using SQL: The CIA factbook is an annual publication of the US Central Intelligence Agency, provides basic intelligence by summarizing information about worldwide demographic and geographic data.
- Predicting Heart Disease: This project aimed to develop a predictive model for heart disease detection using a provided dataset. After data cleaning and feature selection through correlation analysis, a KNN classifier was trained on selected features, achieving an 81.88% accuracy on the test set. While promising, further evaluation and refinement of the model may be necessary before implementation in a real-world healthcare setting.
- Credit Card Customer Segmentation: This project is a customer segmentation analysis using K-means clustering algorithm. The aim is to identify different groups of customers based on their financial behavior and demographic characteristics. The analysis involves exploring the data, determining the optimal number of clusters, interpreting the cluster characteristics, and making business suggestions for each cluster.
- Predicting Medical Insurance Costs: This project involves building a linear regression model to predict medical insurance charges based on several predictors. The model was trained on a dataset of insurance charges and tested on a separate dataset to evaluate its predictive performance. The model's interpretability was assessed by examining the coefficients of the predictors. The mean squared error of the model on the test set was used to evaluate its predictive accuracy. Overall, the model demonstrated a decent level of predictive accuracy and could be useful for predicting medical insurance charges for new patients.
- Crowdedness at Campus Gym: This project involves the development of a model to predict the number of people visiting a university gym at different times. The model is built using a stochastic gradient descent regressor and evaluated using metrics such as mean squared error, mean absolute error, and R2 score. The project also includes data cleaning, feature selection, and normalization to improve model performance. The model's findings can be used to determine the optimal days and times to visit the gym and can be improved further by rearranging the data, including or discarding features, and changing the hyperparameters.
- Heart Disease Classification: The project aimed to develop a logistic regression model to predict the presence or absence of heart disease based on several predictor variables. The model achieved an accuracy of 0.846 on the test set, indicating that it is a decent predictor for this problem. The interpretation of the model's coefficients aligns with prior knowledge and research, though further analysis may be necessary to fully understand the relationship between some of the predictor variables and heart disease risk. Overall, the model can be useful as a screening tool to identify individuals who may be at a higher risk of heart disease.
- Employee Productivity Classification: This project involved using decision trees and random forests to analyze a dataset of garment production runs and determine the factors that most significantly affect productivity. The project included data cleaning and preprocessing, model training and testing, and comparing the performance of decision trees and random forests. The final models were used to provide insights to the leadership team of the garment company.
- Optimizing Model Prediction: In this project, we built a machine learning model to predict the log area of wildfires based on various features such as temperature, humidity, and wind speed. We used various techniques such as imputation, outlier detection, regularization, k-fold cross-validation, and non-linear models to create and refine our model. Our best-performing model was forward selection with four features, which had an average MSE of 1.8823.
- Handwritten Digits Classification: In this project, we used scikit-learn and a dataset of handwritten digits to train various neural network models with different specifications such as the number of hidden layers and neurons per layer. We then evaluated the performance of these models using cross-validation and compared their accuracy scores to determine the most effective model. The results showed that the neural network model with two hidden layers and 256 neurons in each layer had the highest accuracy score of 0.9810.