This project performs a complete data cleaning process using MySQL on a dataset of global layoffs available on Kaggle - layoffs. The goal is to transform the raw dataset into a clean, standardized, and analytics-ready format that can later be used for further data exploration, reporting, and visualization.
The original dataset contains records of tech company layoffs from around the world. However, the raw data includes duplicates, inconsistent formats, and null values.
β¨ This project focuses on:
Creating a staging table to preserve the raw dataset
Removing duplicate records
Standardizing categorical and numerical data
Fixing incorrect or inconsistent entries (e.g., dates, strings, formatting)
Creating a clean analytics table for further analysis
This project was initiated to enhance my proficiency in SQL, particularly in the realm of data cleaningβa fundamental step in any data analysis workflow. Utilizing a publicly available dataset on global tech company layoffs from Kaggle, I aimed to simulate real-world data preprocessing scenarios. The objectives were:
Practice SQL-based data cleaning: Implementing techniques such as removing duplicates, handling null values, and standardizing data formats.
Understand data inconsistencies: Gaining insights into common data quality issues and how to address them effectively.
Prepare data for analysis: Transforming raw data into a structured format suitable for exploratory data analysis and visualization.
Build a portfolio project: Demonstrating my ability to clean and prepare data using SQL, showcasing this skill to potential employers and collaborators.
π Identify and remove duplicate records
βοΈ Standardize text fields like industry and country names
π’ Convert and clean numeric fields like total_laid_off and percentage_laid_off
π Clean and convert date values into proper MySQL DATE format
π Create an analytics table with appropriate data types and indexes
- MySQL
- MySQL Workbench / any SQL client (Dbeaver was used for this project)
- CSV File Source: Layoffs Dataset
- SQL for data transformation
- Git for version control
π Completing this project provided several key takeaways:
Enhanced SQL skills: Improved my ability to write efficient and advanced SQL queries for data cleaning tasks.
Attention to data quality: Recognized the importance of clean data in deriving accurate insights.
Problem-solving: Developed strategies to deal with common data issues such as duplicates and inconsistent formats.
Preparedness for real-world data: Gained experience that is directly applicable to real-world data analysis projects.
This project solidified my understanding of data cleaning processes and underscored the critical role they play in the broader context of data analytics.