- Foundation Course Module 1: Introduction of data analytics and analytical thinking
- Foundation Course Module 2 : The wonderful world of data
- Foundation Course Module 3 : Setup of data analytics toolbox
- Foundation Course Module 4: Becoming a fair and impactful data professional
- Foundation Course: Glossary
- Course 2: Ask questions to make data driven decisions, Module 1: Ask effective questions
- Course 2: Ask questions to make data driven decisions, Module 2: Make data-driven decisions
- Course 2: Ask questions to make data driven decisions, Module 3: Spreadsheet magic
- Course 2: Ask questions to make data driven decisions, Module 4: Always remember the stakeholder
- Course 3: Prepare Data For Exploration: Learning objectives and overviews
- Course 3: Prepare Data For Exploration, Module 1: Data types and structures
- Course 3: Prepare Data For Exploration, Module 2: Data responsibility
- Course 3: Prepare Data For Exploration, Module 3: Database Essentials
- Course 3: Prepare Data For Exploration, Module 4: Organise and Secure Data
- Course 4: Process Data from Dirty to Clean: Overview
- Course 4: Process Data from Dirty to Clean, Module 1: The importance of integrity
- Course 4: Process Data from Dirty to Clean, Module 2: Clean it up
- Course 4: Process Data from Dirty to Clean, Module 3: SQL
- Course 4: Process Data from Dirty to Clean, Module 4: Verify and Report Results
- Course 5: Analyse Data to Answer Questions, Module 1: Organise data for more effective analysis
- Course 5: Analyse Data to Answer Questions, Module 2: Format and adjust data
- Course 5: Analyse Data to Answer Questions, Module 3: Aggregate data for analysis
- Course 5: Analyse Data to Answer Questions, Module 4: Perform Data Calculations
- Course 6: Share Data Through the Art of Visualisation, Course Overview plus Module 1: Visualise Data
- Course 6: Share Data Through the Art of Visualisation, Course Overview plus Module 2: Create Data Visualisation with Tableau
Knowing a variety of ways to clean data can make a data analyst’s job much easier.
Learning Objectives
- Describe how SQL can be used to clean large datasets
- Compare spreadsheet data-cleaning functions to those associated with SQL in databases
- Develop basic SQL queries for use with databases
- Apply basic SQL functions for use in cleaning string variables in a database
- Apply basic SQL functions for transforming data variables
How a junior data analyst uses SQL
In this reading, you will learn more about how to decide when to use SQL, or Structured Query Language. As a data analyst, you will be tasked with handling a lot of data, and SQL is one of the tools that can help make your work a lot easier. SQL is the primary way data analysts extract data from databases. As a data analyst, you will work with databases all the time, which is why SQL is such a key skill. Let’s follow along as a junior data analyst uses SQL to solve a business task.
The business task and context
The junior data analyst in this example works for a social media company. A new business model was implemented on February 15, 2020 and the company wants to understand how their user-growth compares to the previous year. Specifically, the data analyst was asked to find out how many users have joined since February 15, 2020.
Spreadsheets functions and formulas or SQL queries?
Before they can address this question, this data analyst needs to choose what tool to use. First, they have to think about where the data lives. If it is stored in a database, then SQL is the best tool for the job. But if it is stored in a spreadsheet, then they will have to perform their analysis in that spreadsheet. In that scenario, they could create a pivot table of the data and then apply specific formulas and filters to their data until they were given the number of users that joined after February 15th. It isn’t a really complicated process, but it would involve a lot of steps.
In this case, the data is stored in a database, so they will have to work with SQL. And this data analyst knows they could get the same results with a single SQL query:
SELECT
COUNT(DISTINCT user_id) AS count_of_unique_users
FROM
table
WHERE
join_date >= '2020-02-15'
Spreadsheets and SQL both have their advantages and disadvantages:
Features of Spreadsheets | Features of SQL Databases |
---|---|
Smaller data sets | Larger datasets |
Enter data manually | Access tables across a database |
Create graphs and visualizations in the same program | Prepare data for further analysis in another software |
Built-in spell check and other useful functions | Fast and powerful functionality |
Best when working solo on a project | Great for collaborative work and tracking queries run by all users |
When it comes down to it, where the data lives will decide which tool you use. If you are working with data that is already in a spreadsheet, that is most likely where you will perform your analysis. And if you are working with data stored in a database, SQL will be the best tool for you to use for your analysis. Nevertheless, it is a very important tool to learn to become a proper data analyst.
SQL dialects and their uses
In this reading, you will learn more about SQL dialects and some of their different uses. As a quick refresher, Structured Query Language, or SQL, is a language used to talk to databases. Learning SQL can be a lot like learning a new language—including the fact that languages usually have different dialects within them. Some database products have their own variant of SQL, and these different varieties of SQL dialects are what help you communicate with each database product.
These dialects will be different from company to company and might change over time if the company moves to another database system. So, a lot of analysts start with Standard SQL and then adjust the dialect they use based on what database they are working with. Standard SQL works with a majority of databases and requires a small number of syntax changes to adapt to other dialects.
As a junior data analyst, it is important to know that there are slight differences between dialects. But by mastering Standard SQL, which is the dialect you will be working with in this program, you will be prepared to use SQL in any database.
More information / Reference Guides
You may not need to know every SQL dialect, but it is useful to know that these different dialects exist. If you are interested in learning more about SQL dialects and when they are used, you can check out these resources for more information:
- LearnSQL’s blog, What Is a SQL Dialect, and Which One Should You Learn?
- Software Testing Help’s article, Differences Between SQL Vs MySQL vs SQL Server
- Datacamp’s blog, SQL Server, PostgreSQL, MySQL… what’s the difference? Where do I start? Note that there is an error in this blog article. The comparison table incorrectly states that SQlite uses subqueries instead of window functions. Refer to the SQLite Window Functions documentation for proper clarification.
- SQL Tutorial’s tutorial, What is SQL
- And my personal favourite course/tutorial provided by W3 Schools.com SQL Tutorial
Technical: Understand how data is measured under the hood
All information in a computer is represented as a binary number consisting solely of 0’s and 1’s. Each 0 or 1 in a number is a bit, which is the smallest unit of storage in computers. Data is measured by the number of bits it takes to represent it. This is then described in bytes, which are equal to 8 bits (1 bytes = 8 bits).
Take a moment to examine the table below to understand each data measurement and its size relative to the others..
Unit | Abbreviation | Equivalent to | Example (with approximate size) |
---|---|---|---|
Byte | B | 8 bits | 1 character in a string (1 byte) |
Kilobyte | KB | 1024 bytes | A page of text (4 kilobytes) |
Megabyte | MB | 1024 Kilobytes | 1 song in MP3 format (2-3 megabytes) |
Gigabyte | GB | 1024 Megabytes | 300 songs in MP3 format (1 gigabyte) |
Terabyte | TB | 1024 Gigabytes | 500 hours of HD video (1 terabyte) |
Petabyte | PB | 1024 Terabytes | 10 billion Facebook photos (1 petabyte) |
Exabyte | EB | 1024 Petabytes | 500 million hours of HD video (1 exabyte) |
Zettabyte | ZB | 1024 Exabytes | All the data on the internet in 2019 (4.5 ZB) |
Now that you’ve explored data measurements, think about the amount of data in the world. It’s growing at an incredible pace, largely due to the more than 5.3 billion people in the world connected to the internet (as of November 2023). Smartphones and other internet-connected devices generate a staggering amount of new data. Many experts believe that the amount of all the data on the internet will swell to 175 ZB by the end of 2025!
Terms and definitions for Course 4, Module 3
CAST: A SQL function that converts data from one data type to another
COALESCE: A SQL function that returns non-null values in a list
CONCAT: A SQL function that adds strings together to create new text strings that can be used as unique keys
DISTINCT: A keyword that is added to a SQL SELECT statement to retrieve only non-duplicate entries
Float: A number that contains a decimal
Substring: A subset of a text string
Typecasting: Converting data from one type to another