SQL Data Cleaning Project – Layoffs Dataset - Porto

Apr 5, 2025

SQL Data Cleaning Project – Layoffs Dataset

This project showcases the process of cleaning and standardizing a real-world layoffs dataset using SQL. It covers key techniques such as removing duplicates, handling missing values, and standardizing inconsistent data formats. By the end of the project, the dataset is ready for analysis, providing a clean foundation for deriving valuable insights into layoffs across industries and regions. Check out the full code on GitHub.

Author

Tiany Sampilahy Heriniavo

READ

5 mins

SQL

Why Clean Data Matters

In today's data-driven world, having clean and standardized data is crucial for making informed decisions. For this project, I tackled a real-world layoffs dataset to transform raw, inconsistent data into something usable and ready for analysis. The dataset, sourced from AlexTheAnalyst's GitHub repository, contains information on layoffs from different companies, industries, and locations. However, this data was riddled with issues such as duplicates, inconsistent formatting, and missing values.

The goal was simple: clean the dataset to remove inconsistencies, handle missing values, and standardize data for better insights. This project illustrates key data cleaning techniques like duplicate removal, data standardization, and handling null or blank values. Below, I’ll break down the process I followed to achieve a clean and ready-to-analyze dataset

The Process: Mastering Data Cleaning with SQL

Step 1: Database & Table Setup

To begin, I created a new database world_layoffs and imported the raw CSV data into a new table called layoffs.

Original columns:

Step 2: Data Cleaning Goals

The data needed significant cleaning. The main objectives were to:

Remove duplicates: Ensure that there were no repeating records.
Standardize the data: Ensure consistency in categories, dates, and formatting.
Handle null or blank values: Replace or remove entries with missing or invalid values.
Remove unnecessary columns and rows: Clean out irrelevant data that wouldn't add value.

Step 3: Create Staging Table

To safeguard the raw data, I created a staging table (layoffs_staging) that mirrored the original layoffs table. This allowed me to work on a copy of the data, ensuring no changes would be made to the original dataset.

CREATE TABLE layoffs_staging LIKE layoffs;

INSERT INTO layoffs_staging
SELECT * FROM

Step 4: Remove Duplicates

Step 4.1: Create and Use Row Number

The first task was to identify duplicates based on multiple columns. I used a common technique in SQL: assigning a row number to each entry and deleting the duplicates.

WITH duplicate_cte AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY company, location, industry, total_laid_off,
                         percentage_laid_off, `date`, stage, country,
                         funds_raised_millions
        ) AS row_num
    FROM layoffs_staging
)
DELETE FROM duplicate_cte
WHERE row_num > 1

Step 4.2: Create `layoffs_staging2` for cleaning

Next, I created a new staging table (layoffs_staging2) and used the row number logic to clean duplicates. This step ensured I had only unique records left.

CREATE TABLE world_layoffs.layoffs_staging2 (
    company TEXT NULL,
    location TEXT NULL,
    industry TEXT NULL,
    total_laid_off TEXT NULL,
    percentage_laid_off TEXT NULL,
    date TEXT NULL,
    stage TEXT NULL,
    country TEXT NULL,
    funds_raised_millions TEXT NULL,
    row_num INT
);

INSERT INTO layoffs_staging2
SELECT *,
    ROW_NUMBER() OVER (
        PARTITION BY company, location, industry, total_laid_off,
                     percentage_laid_off, `date`, stage, country,
                     funds_raised_millions
    ) AS row_num
FROM layoffs_staging;

DELETE FROM layoffs_staging2
WHERE row_num > 1

Step 5: Data Standardization

Trim Whitespace

Whitespace can often cause inconsistencies in data, so I ensured that all text fields were trimmed to remove leading or trailing spaces.

UPDATE layoffs_staging2
SET company = TRIM(company)

Unify Similar Categories

The industry field had variations like "Crypto", "Crypto Currency", and "CryptoCurrency". I standardized these entries to a single value: "Crypto".

UPDATE layoffs_staging2
SET industry = 'Crypto'
WHERE industry LIKE 'Crypto%'

Clean Trailing Characters

Inconsistent punctuation in the country field (e.g., trailing periods) was cleaned to ensure consistency.

UPDATE layoffs_staging2
SET country = TRIM(TRAILING '.' FROM country)

Format Date

The date field had entries that were either invalid or formatted inconsistently. I standardized all date entries to the format %m/%d/%Y, ensuring all valid dates were properly formatted.

UPDATE layoffs_staging2
SET `date` = NULL
WHERE `date` = 'NULL';

UPDATE layoffs_staging2
SET `date` = STR_TO_DATE(`date`, '%m/%d/%Y');

ALTER TABLE layoffs_staging2
MODIFY COLUMN

Step 6: Handle Null or Blank Values

Fill Industry Values (via Self-Join)

Some rows had missing industry values. I used a self-join to fill these null values with the correct information from other rows with the same company and location.

UPDATE layoffs_staging2 t1
JOIN layoffs_staging2 t2 ON t1.company = t2.company AND t2.location = t1.location
SET t1.industry = t2.industry
WHERE (t1.industry IS NULL OR t1.industry IN ('NULL', ''))
  AND (t2.industry IS NOT NULL AND t2.industry NOT IN ('NULL', ''))

Remove Completely Blank Rows

Any rows that had no useful data (both total_laid_off and percentage_laid_off were missing) were removed from the dataset.

DELETE FROM layoffs_staging2
WHERE total_laid_off = 'NULL'
  AND percentage_laid_off = 'NULL'

Step 7: Final Cleanup

Drop Temporary Column

The row_num column, which was used to help clean the duplicates, was dropped at the end of the process.

ALTER TABLE layoffs_staging2
DROP COLUMN

Conclusion

By the end of this project, the data was fully cleaned and standardized, ready for further analysis or visualization. Removing duplicates, standardizing categories and formatting, and handling missing values ensured that the dataset was both consistent and reliable. The clean dataset is now a perfect starting point for any deeper analysis, such as identifying trends in layoffs across industries or regions.

For the next steps, one could analyze the impact of layoffs by industry, or even build predictive models to forecast layoffs based on various factors like company size, funding, or market conditions. The cleaned dataset can also be used for business intelligence purposes, helping organizations understand the broader economic and workforce trends.

This project demonstrated the importance of data preprocessing in ensuring that insights drawn from data are accurate and actionable. For the full code, check out the GitHub repository.

More Blogs More Blogs

MySQL Exploratory Data Analysis (EDA) on Layoffs Dataset

January 1, 1970

MySQL Exploratory Data Analysis (EDA) on Layoffs Dataset

January 1, 1970

//FAQ

Concerns

Frequently

Asked Questions

What services do you offer?

Do you work on both small and large projects?

What technologies do you use?

What do I need to get started?

How do you approach a new project?

How can I get in touch?

//FAQ

Concerns

Frequently

Asked Questions

What services do you offer?

Do you work on both small and large projects?

What technologies do you use?

What do I need to get started?

How do you approach a new project?

How can I get in touch?

//FAQ

Concerns

Frequently

Asked Question

What services do you offer?

Do you work on both small and large projects?

What technologies do you use?

What do I need to get started?

How do you approach a new project?

How can I get in touch?

//FAQ

Concerns

Frequently

Asked Questions

What services do you offer?

Do you work on both small and large projects?

What technologies do you use?

What do I need to get started?

How do you approach a new project?

How can I get in touch?

Let'S WORK

TOGETHER

CONTACT NOW

BASED IN Antananarivo,

madagascar

Fullstack developer

+ Entrepreneur

Have a project in mind or just want to connect? Let’s build something amazing together!

Github

Email

Let'S WORK

TOGETHER

CONTACT NOW

BASED IN Antananarivo,

madagascar

Fullstack developer

+ Entrepreneur

Have a project in mind or just want to connect? Let’s build something amazing together!

Github

Email

Let'S WORK

TOGETHER

CONTACT NOW

BASED IN USA, I AM AN INNOVATIVE DESIGNER AND DIGITAL ARTIST. MY PASSION FOR MINIMALIST AESTHETICS, ELEGANT TYPOGRAPHY, AND INTUITIVE DESIGN IS EVIDENT IN MY WORK.

Github

Email

Let'S WORK

TOGETHER

CONTACT NOW

BASED IN Antananarivo,

madagascar

Fullstack developer

+ Entrepreneur

BASED IN USA, I AM AN INNOVATIVE DESIGNER AND DIGITAL ARTIST. MY PASSION FOR MINIMALIST AESTHETICS, ELEGANT TYPOGRAPHY, AND INTUITIVE DESIGN IS EVIDENT IN MY WORK.

Tiany Sampilahy Heriniavo

Tiany Sampilahy Heriniavo

5 mins

5 mins

SQL

SQL

Why Clean Data Matters

Why Clean Data Matters

The Process: Mastering Data Cleaning with SQL

The Process: Mastering Data Cleaning with SQL

Step 1: Database & Table Setup

Step 2: Data Cleaning Goals

Step 3: Create Staging Table

Step 4: Remove Duplicates

Step 4.1: Create and Use Row Number

Step 4.2: Create layoffs_staging2 for cleaning

Step 5: Data Standardization

Trim Whitespace

Unify Similar Categories

Clean Trailing Characters

Format Date

Step 6: Handle Null or Blank Values

Fill Industry Values (via Self-Join)

Remove Completely Blank Rows

Step 7: Final Cleanup

Drop Temporary Column

Conclusion

Conclusion

More Blogs More Blogs

Frequently

Asked Questions

What services do you offer?

Do you work on both small and large projects?

What technologies do you use?

What do I need to get started?

How do you approach a new project?

How can I get in touch?

Frequently

Asked Questions

What services do you offer?

Do you work on both small and large projects?

What technologies do you use?

What do I need to get started?

How do you approach a new project?

How can I get in touch?

Frequently

Asked Question

What services do you offer?

Do you work on both small and large projects?

What technologies do you use?

What do I need to get started?

How do you approach a new project?

How can I get in touch?

Frequently

Asked Questions

What services do you offer?

Do you work on both small and large projects?

What technologies do you use?

What do I need to get started?

How do you approach a new project?

How can I get in touch?

Let'S WORK

TOGETHER

CONTACT NOW

Have a project in mind or just want to connect? Let’s build something amazing together!

linkedin

Github

Email

Let'S WORK

TOGETHER

CONTACT NOW

Have a project in mind or just want to connect? Let’s build something amazing together!

linkedin

Github

Email

Let'S WORK

TOGETHER

CONTACT NOW

BASED IN USA, I AM AN INNOVATIVE DESIGNER AND DIGITAL ARTIST. MY PASSION FOR MINIMALIST AESTHETICS, ELEGANT TYPOGRAPHY, AND INTUITIVE DESIGN IS EVIDENT IN MY WORK.

linkedin

Step 4.2: Create `layoffs_staging2` for cleaning