
For an instructor lead, in-depth look at learning SQL click below.
One often overlooked yet significant aspect of data analysis is data cleansing. In this article, we will explore some techniques using SQL in Databricks and demonstrate them with hands-on examples.
1. Removing duplicates
Among the common issues that frequently occur in data sets are duplicates which can distort your analysis. SQL provides an easy solution to find and delete these duplicates. Here is an example:
1 2 3 4 5 6 7 8 |
WITH cte AS (SELECT *, ROW_NUMBER()OVER(PARTITION BY column1, column2, column3 ORDER BY ( SELECT NULL)) AS row_num FROM table_name) DELETE FROM cte WHERE row_num > 1 |
This code assigns a unique number (row_num) for each row and deletes the extra occurrences where row_num is greater than one, thus removing duplicate records.
2. Formatting strings
Another common issue arises from inconsistent string formatting. You can use the SQL functions LOWER() and UPPER() to format the string data. Let’s see an example:
1 2 3 4 |
UPDATE table_name SET column_name = LOWER(column_name); |
SQL updates the entire column, converting all the string values to lower case with the above query. Similarly, you can use UPPER() to convert them to upper case.
3. Replacing Null values
Dealing with NULL values is a common necessity in data cleansing. The COALESCE() function comes in handy. It returns the first non-null value in the list. In the example below, we replace NULL values with a specific value:
1 2 3 4 |
UPDATE table_name SET column_name = COALESCE(column_name, 'Unknown'); |
The above query replaces all NULL values in the specified column with the string ‘Unknown’.
Conclusion
Data cleansing may not be the most glamorous part of data analysis, but it is a fundamentally crucial step towards ensuring data accuracy. Using SQL for data cleansing in Databricks can significantly improve the results of your data analysis tasks. The techniques shared in this guide aim to help you take the first steps in this essential process.