Technical, Data Analytics

Course 4: Process Data from Dirty to Clean, Module 3: SQL

Knowing a variety of ways to clean data can make a data analyst’s job much easier. 

Learning Objectives


  • Describe how SQL can be used to clean large datasets
  • Compare spreadsheet data-cleaning functions to those associated with SQL in databases
  • Develop basic SQL queries for use with databases
  • Apply basic SQL functions for use in cleaning string variables in a database
  • Apply basic SQL functions for transforming data variables

How a junior data analyst uses SQL

In this reading, you will learn more about how to decide when to use SQL, or Structured Query Language. As a data analyst, you will be tasked with handling a lot of data, and SQL is one of the tools that can help make your work a lot easier. SQL is the primary way data analysts extract data from databases. As a data analyst, you will work with databases all the time, which is why SQL is such a key skill. Let’s follow along as a junior data analyst uses SQL to solve a business task.  

The business task and context

The junior data analyst in this example works for a social media company. A new business model was implemented on February 15, 2020 and the company wants to understand how their user-growth compares to the previous year. Specifically, the data analyst was asked to find out how many users have joined since February 15, 2020. 

Spreadsheets functions and formulas or SQL queries?

Before they can address this question, this data analyst needs to choose what tool to use. First, they have to think about where the data lives. If it is stored in a database, then SQL is the best tool for the job. But if it is stored in a spreadsheet, then they will have to perform their analysis in that spreadsheet. In that scenario, they could create a pivot table of the data and then apply specific formulas and filters to their data until they were given the number of users that joined after February 15th. It isn’t a really complicated process, but it would involve a lot of steps. 

In this case, the data is stored in a database, so they will have to work with SQL. And this data analyst knows they could get the same results with a single SQL query:

SQL
SELECT
	COUNT(DISTINCT user_id) AS count_of_unique_users 
FROM 
	table 
WHERE 	
	join_date >= '2020-02-15'

Spreadsheets and SQL both have their advantages and disadvantages:

Features of SpreadsheetsFeatures of SQL Databases 
Smaller data setsLarger datasets
Enter data manuallyAccess tables across a database
Create graphs and visualizations in the same programPrepare data for further analysis in another software
Built-in spell check and other useful functionsFast and powerful functionality
Best when working solo on a projectGreat for collaborative work and tracking queries run by all users

When it comes down to it, where the data lives will decide which tool you use. If you are working with data that is already in a spreadsheet, that is most likely where you will perform your analysis. And if you are working with data stored in a database, SQL will be the best tool for you to use for your analysis. Nevertheless, it is a very important tool to learn to become a proper data analyst.

SQL dialects and their uses

In this reading, you will learn more about SQL dialects and some of their different uses. As a quick refresher, Structured Query Language, or SQL, is a language used to talk to databases. Learning SQL can be a lot like learning a new language—including the fact that languages usually have different dialects within them. Some database products have their own variant of SQL, and these different varieties of SQL dialects are what help you communicate with each database product.

These dialects will be different from company to company and might change over time if the company moves to another database system. So, a lot of analysts start with Standard SQL and then adjust the dialect they use based on what database they are working with. Standard SQL works with a majority of databases and requires a small number of syntax changes to adapt to other dialects.

As a junior data analyst, it is important to know that there are slight differences between dialects. But by mastering Standard SQL, which is the dialect you will be working with in this program, you will be prepared to use SQL in any database. 

More information / Reference Guides

You may not need to know every SQL dialect, but it is useful to know that these different dialects exist. If you are interested in learning more about SQL dialects and when they are used, you can check out these resources for more information:

Technical: Understand how data is measured under the hood

All information in a computer is represented as a binary number consisting solely of 0’s and 1’s. Each 0 or 1 in a number is a bit, which is the smallest unit of storage in computers. Data is measured by the number of bits it takes to represent it. This is then described in bytes, which are equal to 8 bits (1 bytes = 8 bits).

Take a moment to examine the table below to understand each data measurement and its size relative to the others..

UnitAbbreviationEquivalent toExample (with approximate size)
ByteB8 bits1 character in a string (1 byte)
KilobyteKB1024 bytesA page of text (4 kilobytes)
MegabyteMB1024 Kilobytes1 song in MP3 format (2-3 megabytes)
GigabyteGB1024 Megabytes300 songs in MP3 format (1 gigabyte)
TerabyteTB1024 Gigabytes500 hours of HD video (1 terabyte)
PetabytePB1024 Terabytes10 billion Facebook photos (1 petabyte)
ExabyteEB1024 Petabytes500 million hours of HD video (1 exabyte)
ZettabyteZB1024 ExabytesAll the data on the internet in 2019 (4.5 ZB)

Now that you’ve explored data measurements, think about the amount of data in the world. It’s growing at an incredible pace, largely due to the more than 5.3 billion people in the world connected to the internet (as of November 2023). Smartphones and other internet-connected devices generate a staggering amount of new data. Many experts believe that the amount of all the data on the internet will swell to 175 ZB by the end of 2025!

Terms and definitions for Course 4, Module 3

CAST: A SQL function that converts data from one data type to another

COALESCE: A SQL function that returns non-null values in a list

CONCAT: A SQL function that adds strings together to create new text strings that can be used as unique keys

DISTINCT: A keyword that is added to a SQL SELECT statement to retrieve only non-duplicate entries

Float: A number that contains a decimal

Substring: A subset of a text string

Typecasting: Converting data from one type to another

Series Navigation<< Course 4: Process Data from Dirty to Clean, Module 2: Clean it upCourse 4: Process Data from Dirty to Clean, Module 4: Verify and Report Results >>
Tagged ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.