Apr 5, 2025
SQL Data Cleaning Project – Layoffs Dataset
This project showcases the process of cleaning and standardizing a real-world layoffs dataset using SQL. It covers key techniques such as removing duplicates, handling missing values, and standardizing inconsistent data formats. By the end of the project, the dataset is ready for analysis, providing a clean foundation for deriving valuable insights into layoffs across industries and regions. Check out the full code on GitHub.
Author
READ
Category
In today's data-driven world, having clean and standardized data is crucial for making informed decisions. For this project, I tackled a real-world layoffs dataset to transform raw, inconsistent data into something usable and ready for analysis. The dataset, sourced from AlexTheAnalyst's GitHub repository, contains information on layoffs from different companies, industries, and locations. However, this data was riddled with issues such as duplicates, inconsistent formatting, and missing values.
The goal was simple: clean the dataset to remove inconsistencies, handle missing values, and standardize data for better insights. This project illustrates key data cleaning techniques like duplicate removal, data standardization, and handling null or blank values. Below, I’ll break down the process I followed to achieve a clean and ready-to-analyze dataset


Step 1: Database & Table Setup
To begin, I created a new database world_layoffs and imported the raw CSV data into a new table called layoffs.
Original columns:
Step 2: Data Cleaning Goals
The data needed significant cleaning. The main objectives were to:
Remove duplicates: Ensure that there were no repeating records.
Standardize the data: Ensure consistency in categories, dates, and formatting.
Handle null or blank values: Replace or remove entries with missing or invalid values.
Remove unnecessary columns and rows: Clean out irrelevant data that wouldn't add value.
Step 3: Create Staging Table
To safeguard the raw data, I created a staging table (layoffs_staging) that mirrored the original layoffs table. This allowed me to work on a copy of the data, ensuring no changes would be made to the original dataset.
Step 4: Remove Duplicates
Step 4.1: Create and Use Row Number
The first task was to identify duplicates based on multiple columns. I used a common technique in SQL: assigning a row number to each entry and deleting the duplicates.
Step 4.2: Create layoffs_staging2 for cleaning
Next, I created a new staging table (layoffs_staging2) and used the row number logic to clean duplicates. This step ensured I had only unique records left.
Step 5: Data Standardization
Trim Whitespace
Whitespace can often cause inconsistencies in data, so I ensured that all text fields were trimmed to remove leading or trailing spaces.
Unify Similar Categories
The industry field had variations like "Crypto", "Crypto Currency", and "CryptoCurrency". I standardized these entries to a single value: "Crypto".
Clean Trailing Characters
Inconsistent punctuation in the country field (e.g., trailing periods) was cleaned to ensure consistency.
Format Date
The date field had entries that were either invalid or formatted inconsistently. I standardized all date entries to the format %m/%d/%Y, ensuring all valid dates were properly formatted.
Step 6: Handle Null or Blank Values
Fill Industry Values (via Self-Join)
Some rows had missing industry values. I used a self-join to fill these null values with the correct information from other rows with the same company and location.
Remove Completely Blank Rows
Any rows that had no useful data (both total_laid_off and percentage_laid_off were missing) were removed from the dataset.
Step 7: Final Cleanup
Drop Temporary Column
The row_num column, which was used to help clean the duplicates, was dropped at the end of the process.

By the end of this project, the data was fully cleaned and standardized, ready for further analysis or visualization. Removing duplicates, standardizing categories and formatting, and handling missing values ensured that the dataset was both consistent and reliable. The clean dataset is now a perfect starting point for any deeper analysis, such as identifying trends in layoffs across industries or regions.
For the next steps, one could analyze the impact of layoffs by industry, or even build predictive models to forecast layoffs based on various factors like company size, funding, or market conditions. The cleaned dataset can also be used for business intelligence purposes, helping organizations understand the broader economic and workforce trends.
This project demonstrated the importance of data preprocessing in ensuring that insights drawn from data are accurate and actionable. For the full code, check out the GitHub repository.


