This project focuses on cleaning and standardizing a raw layoffs dataset using SQL in MySQL. The cleaned data is prepared for further analysis and visualization.
- Source File:
layoffs.csv
- Total Columns: 9
- Columns:
company
location
industry
total_laid_off
percentage_laid_off
date
stage
country
funds_raised_millions
- Preserve raw data using staging tables.
- Remove duplicates using
ROW_NUMBER()
and CTEs. - Standardize data fields (e.g., company, country, industry).
- Handle missing and null values.
- Convert the
date
field from text to properDATE
format. - Drop unnecessary or helper columns.
CREATE TABLE layoffs_staging LIKE layoffs;
INSERT INTO layoffs_staging SELECT * FROM layoffs;
WITH duplicate_cte AS (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY company, location, industry, total_laid_off,
percentage_laid_off, date, stage, country, funds_raised_millions
) AS row_num
FROM layoffs_staging
)
DELETE FROM layoffs_staging2 WHERE row_num > 1;
- Trim whitespace from
company
- Normalize
industry
(e.g., changeCrypto/Web3
toCrypto
) - Clean up
country
values (e.g., remove trailing.
)
UPDATE layoffs_staging2
SET `date` = STR_TO_DATE(`date`, '%m/%d/%Y');
ALTER TABLE layoffs_staging2
MODIFY COLUMN `date` DATE;
- Set
''
(empty string) inindustry
toNULL
- Use self-joins to fill missing
industry
values from other rows - Delete rows where both
total_laid_off
andpercentage_laid_off
are null
ALTER TABLE layoffs_staging2
DROP COLUMN row_num;
The layoffs_staging2
table contains cleaned, deduplicated, and well-formatted data ready for querying and visualization.
- How to use SQL window functions (
ROW_NUMBER()
) - Data normalization and consistency practices
- Handling missing values using update joins
- Changing column data types safely in SQL
- MySQL
- SQL (Window Functions, CTEs, Joins)
- Dataset:
layoffs.csv
- Project by: Aaditya Aanand
You can import the dataset into MySQL using tools like MySQL Workbench or CLI:
LOAD DATA INFILE 'path/to/layoffs.csv'
INTO TABLE layoffs
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
Make sure secure_file_priv
is configured correctly, or use a GUI like DBeaver for easier import.