Technical, Data Analytics

Course 4: Process Data from Dirty to Clean, Module 4: Verify and Report Results

When we clean data, you make changes to the original dataset. It’s important to verify the changes we make are accurate and to let your teammates know about the changes. In this part of the course, we’ll learn to verify that data is clean and report our data cleaning results. With verified clean data, we are ready to begin analysing!

Learning Objectives


  • Describe the process involved in verifying the results of cleaning data
  • Describe what is involved in manually cleaning data
  • Discuss the elements and importance of data-cleaning reports
  • Describe the benefits of documenting data cleaning process

Case Study: Verification of data cleaning

What you’ll need

If you’d like to follow along with the examples in this video, choose a spreadsheet tool. Google Sheets or Excel are recommended.

To access the spreadsheet the instructor uses in this video, click the link to the template to create a copy of the dataset. If you don’t have a Google account, download the data directly from the attachments below.

Link to dataset: Jeff’s Party Planet – Data for Cleaning

Example 1: Verify data with spreadsheets

Use spreadsheet tools such as Find and Replace and pivot tables to find, understand, and fix errors in your spreadsheet.

Use Find and Replace to replace all instances of a mistake

  1. Use the Jeff’s Party Planet – Data for Cleaning dataset.
  2. From the Edit menu, choose Find and Replace to open the Find and replace dialogue box.
  3. In the Find field, enter the misspelled word in the supplier name, Plos.
  4. In the Replace with field, enter Plus.
  5. Click Replace all to replace all instances of “Plos” with “Plus”. Click Done to close the Find and replace dialogue box.
  6. Select the Undo button to use a different method to correct this misspelling. This can also be done with Ctrl (Windows) or Command (Mac) Z.

Use a pivot table to understand errors in a spreadsheet

  1. Select the Suppliers column.
  2. Select Insert > Pivot Table. In the Create pivot table dialogue box, choose New Sheet then Create.
  3. This creates a new tab that is mostly blank.
  4. Additionally, the Pivot table editor pane is in the window.
  5. Next to Rows. Select Add, then the Suppliers column.
  6. Next to Values, select Add then select Suppliers. This adds a value for the Suppliers column.
  7. By default, Google Sheets sets the value to summarize by COUNTA (the total number of values in a range). This will show how many times each supplier name comes up. It’s a great way to check for misspellings and other anomalies. Note: Don’t use COUNT, because COUNT counts only numerical values.
  8. When there is only one instance of the misspelled name, manually change it to the correct spelling.
  9. To return to the original sheet, select the Sheet1 tab.

Example 2: Use a CASE statement to verify data in SQL (in BigQuery)

Use CASE statements to correct misspellings in SQL.

1. The SQL table used in this example is not available for download, but if you were performing a similar query, you’d first make sure to load the data in BigQuery.

2. Start your SQL query with the basic structure:

SELECT

FROM

WHERE

3. In the FROM clause, specify the table you’re pulling data from after FROM. For example, project-id.customer_data.customer_name

4. In the SELECT clause, specify the columns you want to return. In this example, you want customer_id and first_name.

5. However, there is a misspelling in a customer’s first name.

i. To correct the misspelled name “Tnoy” to “Tony”, use a CASE statement.

ii. Enter CASE. On the next line, enter WHEN first_name = ‘Tnoy’THEN ‘Tony’. This tells SQL to replace any instances of Tnoy in the first_name column with Tony.

iii. On the next line, add the statement ELSE first_name to keep other names as they are.

iv. End the statement with END AS cleaned_name. This creates a new column called cleaned_name that will contain the data cleaned with the CASE statement.

6. Delete the WHERE clause because you don’t want to filter the query.

7. The final statement should be:

SQL
SELECT
    Customer_id,
    CASE
	  WHEN first_name = 'Tnoy' THEN 'Tony'
	  ELSE first_name
	  END AS cleaned_name
FROM
   project-id.customer_data.customer_name

8. This SQL query will correct the misspelled name and leave other names unchanged in a new column called cleaned_name. Note that this query corrects only the display of the name; it does not update the table’s data.

Data-cleaning verification checklist

This reading will give you a checklist of common problems you can refer to when doing your data cleaning verification, no matter what tool you are using. When it comes to data cleaning verification, there is no one-size-fits-all approach or a single checklist that can be universally applied to all projects. Each project has its own organization and data requirements that lead to a unique list of things to run through for verification. 

Keep in mind, as you receive more data or a better understanding of the project goal(s), you might want to revisit some or all of these steps. 

Correct the most common problems

Make sure you identified the most common problems and corrected them, including:

  • Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset?
  • Null data: Did you search for NULLs using conditional formatting and filters?
  • Misspelled words: Did you locate all misspellings?
  • Mistyped numbers: Did you double-check that your numeric data has been entered correctly?
  • Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function?
  • Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?
  • Mismatched data types: Did you check that numeric, date, and string data are typecast correctly?
  • Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful?
  • Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset?
  • Misleading variable labels (columns): Did you name your columns meaningfully?
  • Truncated data: Did you check for truncated or missing data that needs correction?
  • Business Logic: Did you check that the data makes sense given your knowledge of the business? 

Review the goal of your project

Once you have finished these data cleaning tasks, it is a good idea to review the goal of your project and confirm that your data is still aligned with that goal. This is a continuous process that you will do throughout your project– but here are three steps you can keep in mind while thinking about this: 

  • Confirm the business problem 
  • Confirm the goal of the project
  • Verify that data can solve the problem and is aligned to the goal

Embrace changelogs

What do engineers, writers, and data analysts have in common? Change.

Engineers use engineering change orders (ECOs) to keep track of new product design details and proposed changes to existing products. Writers use document revision histories to keep track of changes to document flow and edits. And data analysts use changelogs to keep track of data transformation and cleaning. Here are some examples of these:

Automated version control takes you most of the way

Most software applications have a kind of history tracking built in. For example, in Google sheets, you can check the version history of an entire sheet or an individual cell and go back to an earlier version. In Microsoft Excel, you can use a feature called Track Changes. And in BigQuery, you can view the history to check what has changed.

Here’s how it works:

Google Sheets1. Right-click the cell and select Show edit history. 2. Click the left-arrow < or right arrow > to move backward and forward in the history as needed.
Microsoft Excel1. If Track Changes has been enabled for the spreadsheet: click Review. 2. Under Track Changes, click the Accept/Reject Changes option to accept or reject any change made.
BigQueryBring up a previous version (without reverting to it) and figure out what changed by comparing it to the current version.

Changelogs take you down the last mile

A changelog can build on your automated version history by giving you an even more detailed record of your work. This is where data analysts record all the changes they make to the data. Here is another way of looking at it. Version histories record what was done in a data change for a project, but don’t tell us why. Changelogs are super useful for helping us understand the reasons changes have been made. Changelogs have no set format and you can even make your entries in a blank document. But if you are using a shared changelog, it is best to agree with other data analysts on the format of all your log entries.

Typically, a changelog records:  

  • Data, file, formula, query, or any other component that changed
  • Description of what changed
  • Date of the change
  • Person who made the change
  • Person who approved the change 
  • Version number 
  • Reason for the change

Let’s say you made a change to a formula in a spreadsheet because you observed it in another report and you wanted your data to match and be consistent. If you found out later that the report was actually using the wrong formula, an automated version history would help you undo the change. But if you also recorded the reason for the change in a changelog, you could go back to the creators of the report and let them know about the incorrect formula. If the change happened a while ago, you might not remember who to follow up with. Fortunately, your changelog would have that information ready for you! By following up, you would ensure data integrity outside your project. You would also be showing personal integrity as someone who can be trusted with data. That is the power of a changelog!

Finally, a changelog is important for when lots of changes to a spreadsheet or query have been made. Imagine an analyst made four changes and the change they want to revert to is change #2. Instead of clicking the undo feature three times to undo change #2 (and losing changes #3 and #4), the analyst can undo just change #2 and keep all the other changes. Now, our example was for just 4 changes, but try to think about how important that changelog would be if there were hundreds of changes to keep track of.

Pro Tip:

If an analyst is making changes to an existing SQL query that is shared across the company, the company most likely uses what is called a version control system. An example might be a query that pulls daily revenue to build a dashboard for senior management. 

Here’s how a version control system affects a change to a query:

  1. A company has official versions of important queries in their version control system.
  2. An analyst makes sure the most up-to-date version of the query is the one they will change. This is called syncing 
  3. The analyst makes a change to the query.
  4. The analyst might ask someone to review this change. This is called a code review and can be informally or formally done. An informal review could be as simple as asking a senior analyst to take a look at the change.
  5. After a reviewer approves the change, the analyst submits the updated version of the query to a repository in the company’s version control system. This is called a code commit. A best practice is to document exactly what the change was and why it was made in a comments area. Going back to our example of a query that pulls daily revenue, a comment might be: Updated revenue to include revenue coming from the new product, Calypso.
  6. After the change is submitted, everyone else in the company will be able to access and use this new query when they sync to the most up-to-date queries stored in the version control system.
  7. If the query has a problem or business needs change, the analyst can undo the change to the query using the version control system. The analyst can look at a chronological list of all changes made to the query and who made each change. Then, after finding their own change, the analyst can revert to the previous version.
  8. The query is back to what it was before the analyst made the change. And everyone at the company sees this reverted, original query, too.

So how to create a changelog?

A changelog is a document used to record the notable changes made to a project over its lifetime across all of its tasks. It is typically curated so that the changes it records are listed chronologically across all versions of the project.

The major benefit to using changelogs is that contributors and users connected with the project get a specific list of what important alterations have been made, when they were made, and sometimes, what version they were released for. It is an invaluable tool for communicating how the project has evolved over time to coworkers, management, and stakeholders.

A changelog for a personal project may take any form desired. However, in a professional setting and while collaborating with others, readability is important. These guiding principles help to make a changelog accessible to others: 

  • Changelogs are for humans, not machines, so write legibly.
  • Every version should have its own entry.
  • Each change should have its own line.
  • Group the same types of changes. For example, Fixed should be grouped separately from Added.
  • Versions should be ordered chronologically starting with the latest.
  • The release date of each version should be noted.

All the changes for each category should be grouped together. Types of changes usually fall into one of the following categories:

  • Added: new features introduced
  • Changed: changes in existing functionality
  • Deprecated: features about to be removed
  • Removed: features that have been removed
  • Fixed: bug fixes
  • Security: lowering vulnerabilities

Examine the figure below for an example of a changelog. Note that the following example is written in Markdown, as it is common to keep changelogs as a readme file in a code repository. 

Markdown
# Changelog
This file contains the notable changes to the project

Version 1.0.0 (02-23-2019)
## New
    - Added column classifiers (Date, Time, PerUnitCost, TotalCost, etc. )
    - Added Column "AveCost" to track average item cost

## Changes 
    - Changed date format to MM-DD-YYYY
    - Removal of whitespace (cosmetic)

## Fixes
    - Fixed misalignment in Column "TotalCost" where some rows did not match with     
      correct dates
    - Fixed SUM to run over entire column instead of partial

Now that you’re familiar with the example, consider what changes you need to record in a changelog. To start, you record the various changes, additions, and fixes that were discussed above. Arrange them using bullets or numbering with one change per line. Group similar changes together with a label describing the change immediately above them.

Use different version numbers for each milestone reached in your project. Within each version, place the logged changes that were made since the previous version (milestone). Dates are not generally necessary for each change, but they are recommended for each version.

You can do this using a simple text file or spreadsheet and include your changelog with the project write-up. It will help you stay organized and collaborate with others.

Advanced functions for speedy data cleaning

some advanced functions that can help you speed up the data cleaning process in spreadsheets. Below is a table summarizing three functions and what they do:

Keeping data clean and in sync with a source

The IMPORTRANGE function in Google Sheets and the Paste Link feature (a Paste Special option in Microsoft Excel) both allow you to insert data from one sheet to another. Using these on a large amount of data is more efficient than manual copying and pasting. They also reduce the chance of errors being introduced by copying and pasting the wrong data. They are also helpful for data cleaning because you can “cherry pick” the data you want to analyze and leave behind the data that isn’t relevant to your project. Basically, it is like canceling noise from your data so you can focus on what is most important to solve your problem. This functionality is also useful for day-to-day data monitoring; with it, you can build a tracking spreadsheet to share the relevant data with others. The data is synced with the data source so when the data is updated in the source file, the tracked data is also refreshed.

In Google Sheets, you can use the IMPORTRANGE function. It enables you to specify a range of cells in the other spreadsheet to duplicate in the spreadsheet you are working in. You must allow access to the spreadsheet containing the data the first time you import the data.

The URL shown below is for syntax purposes only. Don’t enter it in your own spreadsheet. Replace it with a URL to a spreadsheet you have created so you can control access to it by clicking the Allow access button.

Refer to the Google support page for IMPORTRANGE for the sample usage and syntax.

Example of using IMPORTRANGE

An analyst monitoring a fundraiser needs to track and ensure that matching funds are distributed. They use IMPORTRANGE to pull all the matching transactions into a spreadsheet containing all of the individual donations. This enables them to determine which donations eligible for matching funds still need to be processed. Because the total number of matching transactions increases daily, they simply need to change the range used by the function to import the most up-to-date data. 

On Tuesday, they use the following to import the donor names and matched amounts:

=IMPORTRANGE(“https://docs.google.com/spreadsheets/d/abcd123abcd123”, “sheet1!A1:C10”, “Matched Funds!A1:B4001”)

On Wednesday, another 500 transactions were processed. They increase the range used by 500 to easily include the latest transactions when importing the data to the individual donor spreadsheet:

=IMPORTRANGE(“https://docs.google.com/spreadsheets/d/abcd123abcd123”, “Matched Funds!A1:B4501”)

Note: The above examples are for illustrative purposes only. Don’t copy and paste them into your spreadsheet. To try it out yourself, you will need to substitute your own URL (and sheet name if you have multiple tabs) along with the range of cells in the spreadsheet that you have populated with data.

Pulling data from other data sources

The QUERY function is also useful when you want to pull data from another spreadsheet. The QUERY function’s SQL-like ability can extract specific data within a spreadsheet. For a large amount of data, using the QUERY function is faster than filtering data manually. This is especially true when repeated filtering is required. For example, you could generate a list of all customers who bought your company’s products in a particular month using manual filtering. But if you also want to figure out customer growth month over month, you have to copy the filtered data to a new spreadsheet, filter the data for sales during the following month, and then copy those results for the analysis. With the QUERY function, you can get all the data for both months without a need to change your original dataset or copy results.

The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name and the range of data that you want to query from, and then use the SQL SELECT command to select the specific columns. You can also add specific criteria after the SELECT statement by including a WHERE statement. But remember, all of the SQL code you use has to be placed between the quotes!

Google Sheets run the Google Visualization API Query Language across the data. Excel spreadsheets use a query wizard to guide you through the steps to connect to a data source and select the tables. In either case, you are able to be sure that the data imported is verified and clean based on the criteria in the query.

Examples of using QUERY

Check out the Google support page for the QUERY function with sample usage, syntax, and examples you can download in a Google sheet.

Link to make a copy of the sheet: QUERY examples

The solution

Analysts can use SQL to pull a specific dataset into a spreadsheet. They can then use the QUERY function to create multiple tabs (views) of that dataset. For example, one tab could contain all the sales data for a particular month and another tab could contain all the sales data from a specific region. This solution illustrates how SQL and spreadsheets are used well together.

Filtering data to get what you want

The FILTER function is fully internal to a spreadsheet and doesn’t require the use of a query language. The FILTER function lets you view only the rows (or columns) in the source data that meet your specified conditions. It makes it possible to pre-filter data before you analyze it.

The FILTER function might run faster than the QUERY function. But keep in mind, the QUERY function can be combined with other functions for more complex calculations. For example, the QUERY function can be used with other functions like SUM and COUNT to summarize data, but the FILTER function can’t.

Example of using FILTER

Check out the Google support page for the FILTER function with sample usage, syntax, and examples you can download in a Google sheet.

Link to make a copy of the sheet: FILTER examples

Glossary terms from module 4

Terms and definitions for Course 4, Module 4

CASE: A SQL statement that returns records that meet conditions by including an if/then statement in a query

Changelog: A file containing a chronologically ordered list of modifications made to a project

COUNTA: A spreadsheet function that counts the total number of values within a specified range

Find and replace: A tool that finds a specified search term and replaces it with something else

Verification: A process to confirm that a data-cleaning effort was well executed and the resulting data is accurate and reliable

Series Navigation<< Course 4: Process Data from Dirty to Clean, Module 3: SQLCourse 5: Analyse Data to Answer Questions, Module 1: Organise data for more effective analysis >>
Tagged ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.