By now, you have probably integrated Multi-Modal AI tools, such as Google’s Deepmind Gemini, OpenAI’s ChatGPT, and similar technologies, into your day-to-day use. You might have also heard the news around AI replacing data analysts and software developers.
Bonus: Read about the trending Technology behind Generative AI!, here.👈🏻
Without a doubt, the future of DA nowadays is characterized by tasks including increased automation and integration of AI, along with handling big data and proficiency in clear & data-driven communication with visualization. If you also feels the same, drop it in the comments below.
That’s why its crucial for analysts to play a pivotal role in extracting meaningful insights from vast data while simultaneously adapting to the dynamic technological changes.
Why it is Important to clean the messy Data?
One thing is for sure: the raw data is always contaminated with duplicates, errors, or outliers. Cleaning the messy data means maintaining the data quality in a spreadsheet or through a programming language preferably Python, so that your insights are neither wrong nor skewed.
Now, before diving deep, let’s address the elephant in the room:
Why SQL?
In 2024, SQL is the most popular and requested skill for any data related roles, so you have no choice, but to master it. In this article, I’ve put together Top 5 Commonly used SQL Queries to clean your messy data.
If you find these tips helpful, do not forget to clap 👏🏻 and comment🖋️ down your thoughts!
Some of them are broad and advanced level, while some of them are more technical. So, let’s get into them.
1. Handling `NULL` Values
While dealing with voluminous datasets — missing data poses a significant challenge before the analysts. It is crucial to address this issue by handling the null values effectively.
There are two primary approaches that you can use:
A. Data Reduction (Removing Rows with Null Values) :
This method is optimal when you have fewer null values, approximately 0.5% of the entire data. Removing rows or even columns with null values helps maintain data integrity and reduces the impact of missing data.
For instance, you can use this query:
-- Beginner level : Filter out Null Rows
SELECT *
FROM your_table_name
WHERE col1 IS NOT NULL;
Basically, this query will filter out the rows where col1
contains null values, returning only the rows that have non-null values in col1
.
B. Data Imputation (Replacing Rows with Null Values) :
While working with datasets having a substantial proportion of null values (more than 10%), it is recommended to replace these missing values with suitable statistical substitutes, such as mean, median, or mode for numerical data or commonly occurring words for categorical data.
Remember to be cautious, while exercising the imputation of the 3-M — doesn’t change the original distribution of data completely.
Now, let’s write a query that replaces null values with 'N/A’
.
What is the
COALESCE
function?The
COALESCE
function is used to handle null values in SQL. It returns the value of any row if it is not null; otherwise, it substitutes it with the string'N/A'
.
-- Advanced level : Substituting Null Columns with "N/A"
SELECT col1, COALESCE(col2, 'N/A') AS col2
FROM your_table_name;
This ensures that even if col2
contains null values, it will be displayed as 'N/A'
in the result to reduce data loss and ensure more comprehensive and reliable analysis.
By adopting either of these appropriate strategies, you can ensure data quality and derive accurate insights, leading to more informed decision-making.
2. Formatting and Standardizing Data
To ensure consistency and uniformity in the dataset, data formatting and standardization are essential steps. In SQL, it involves ensuring uniform date formats, data types, normalizing text values, and applying consistent naming conventions.
Let’s perform capitalization on our data using the LEFT
function to update the existing column data with the first character in uppercase and the rest in lowercase:
-- Intermediate level : Capitalizing Rows
SELECT CONCAT(LEFT(col1, 1), LOWER(SUBSTRING(col1, 2))) AS col1
FROM your_table_name;
Here, the SUBSTRING
function, and the CONCAT
function combines the modified parts back into col1
.
Here is an another example, using SQL’s INITCAP
function to capitalize the first letter of each word in col1
.
-- Advanced level : Updating Rows
UPDATE your_table_name
SET col1 = INITCAP(col1),
col2 = REPLACE(col2, 'old_value', 'new_value');
Also, REPLACE
function is applied to col2
to search for occurrences of the string ‘old_value’ and replace them with ‘new_value’.
Standardizing data facilitating seamless data comparison across different tables or databases.
Here’s the GitHub Repository for, “Feature Engineering to perform data standardization using Python”.
3. Removing `Unwanted` Characters
Trust me on this, working with text data is much more tedious. The removal of unwanted characters (such as punctuation marks, special symbols or irrelevant noise such as white spaces, etc.) are crucial to enhance the quality of data.
Let’s explore this query —
Herein, I’ve used the REGEXP_REPLACE
function, which is used to perform a regular expression-based replacement for col1
values.
-- Intermediate level : Modifying Table
SELECT REGEXP_REPLACE(col1, '[^a-zA-Z0-9 ]', '') AS col1
FROM your_table_name;
How does the Regular Expression pattern
[^a-zA-Z0-9 ]
works?Specifically, this expression matches all non-alphanumeric characters (except spaces between the characters) including both uppercase as well as lowercase and numbers are replaced with an empty string as the replacement.
Also you can use this query to update your table, when you’re satisfied with the results:
-- Advanced level : Updating Table
UPDATE your_table_name
SET col1 = REGEXP_REPLACE(col1, '[^a-zA-Z0-9 ]', '');
This query will modify the existing data in the table by updating col1
with the transformed values, permanently.
Here’s the GitHub Repository for, “Data Extraction and Text Analysis using NLP”.
4. Removing `Duplicates`
Duplicates are the another most common issue before every analysts, irrespective of dataset size. While Excel or Google Sheets are helpful in removing duplicates from smaller datasets, the true test lies in working with large data.
For this, you can use the DISTINCT
keyword ensures that only unique rows are returned, eliminating any duplicates.
-- Beginner level : To find unique data
SELECT DISTINCT *
FROM your_table_name;
But a better way to identify the duplicates is using CTE:
-- Advanced level : To delete duplicate data
WITH cte AS (
SELECT col1, col2, col3,
ROW_NUMBER() OVER (PARTITION BY col1, col2, col3 ORDER BY col1) AS row_number
FROM your_table_name
)
DELETE FROM cte
WHERE row_number > 1;
In this query, the ROW_NUMBER()
function assigns a unique number to each row within the partition defined by the columns specified in the PARTITION BY
clause in the Common Table Expression (CTE).
And finally, the DELETE
statement will remove those duplicate rows with a row number greater than 1 from the table.
Note: Exercise caution while using the
DELETE
statement, as it’ll permanently remove data. In such cases, it is recommended to have backup or confirmation from manager before executing such a query on critical data.In this code, I used CTE, which is a temporary result set similar to as a temporary table that does not store data in memory.
5. Handling `Outliers`
Outliers are the observations in a given dataset that lie “far”, simply stating, deviate significantly from the rest of the observations. That means, an outlier is a data point that is vastly larger or smaller than the remaining values of the set.
These extreme values have a significant impact on statistical analysis and modelling results. It is crucial to identify and handle outliers appropriately to ensure accurate data analysis.
In SQL, there are several statistical techniques for identifying outliers, such as Z-score and Interquartile Range (IQR).
Let’s see how to address them:
A. Removing Outliers
The below query removes the rows from the data table whose values fall outside the interquartile range (IQR).
I’ve used the PERCENTILE_CONT
function to filter the rows where data falls within the calculated Q1
and Q3
.
-- Advanced level : To detect and delete outliers from your data
WITH quartiles AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS Q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS Q3
FROM
your_table
),
iqr_calc AS (
SELECT
Q1,
Q3,
(Q3 - Q1) AS IQR
FROM
quartiles
)
DELETE FROM your_table_name
WHERE value < (SELECT Q1–1.5 * IQR FROM iqr_calc)
OR value > (SELECT Q3 + 1.5 * IQR FROM iqr_calc);
This is the easiest way to handle the outliers, but be cautious with the DELETE
function.
B. Replacing Outliers
Values that are identified as outliers are replaced with the median using this query.
-- Advanced level : To detect and replace outliers in your data
WITH quartiles AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS Q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS Q3
FROM
your_table_name
),
iqr_calc AS (
SELECT
Q1,
Q3,
(Q3 - Q1) AS IQR
FROM
quartiles
),
median_value AS (
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) AS median
FROM
your_table_name
)
UPDATE your_table_name
SET value = (SELECT median FROM median_value)
WHERE
value < (SELECT Q1 - 1.5 * IQR FROM iqr_calc)
OR value > (SELECT Q3 + 1.5 * IQR FROM iqr_calc);
C. Flagging Outliers
Flagging means creating a new column where rows are set to 1
if the value is an outlier and 0
otherwise.
-- Advanced level : To detect and flag outliers
WITH quartiles AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS Q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS Q3
FROM
your_table_name
),
iqr_calc AS (
SELECT
Q1,
Q3,
(Q3 - Q1) AS IQR
FROM
quartiles
)
UPDATE your_table_name
SET is_outlier = CASE
WHEN value < (SELECT Q1 - 1.5 * IQR FROM iqr_calc)
OR value > (SELECT Q3 + 1.5 * IQR FROM iqr_calc)
THEN 1
ELSE 0
END;
Why Flag Outliers?
Instead of removing or altering outliers, flagging allows you to keep the original data intact, while still identifying which data points are outliers. This way, you can separately analyze the outliers to understand the reasons behind their deviation.
Proper handling of outliers not only helps in maintaining data integrity but also improves the model’s reliability to generate accurate insights from the data.
Here’s the GitHub Repository for, “How to detect and handle outlier using Python?”
Conclusion
In conclusion, while the order of implementation may vary (as per your requirement), but these are some really helpful methods that can reduce the noise and enhance the accuracy of your analysis.
In case, you’re also interested, here’s the GitHub Repository for, “Step-by-step Feature Engineering Pipeline for Machine Learning using Python”.
And, if you enjoy this read, Clap 50 👏 times and do not forget to follow for future updates.
That’s it from me. Will talk soon! 🙋🏻♀️
— Nikita Prasad
The contents of external submissions are not necessarily reflective of the opinions or work of Maven Analytics or any of its team members.
We believe in fostering lifelong learning and our intent is to provide a platform for the data community to share their work and seek feedback from the Maven Analytics data fam.