- Foundation Course Module 1: Introduction of data analytics and analytical thinking
- Foundation Course Module 2 : The wonderful world of data
- Foundation Course Module 3 : Setup of data analytics toolbox
- Foundation Course Module 4: Becoming a fair and impactful data professional
- Foundation Course: Glossary
- Course 2: Ask questions to make data driven decisions, Module 1: Ask effective questions
- Course 2: Ask questions to make data driven decisions, Module 2: Make data-driven decisions
- Course 2: Ask questions to make data driven decisions, Module 3: Spreadsheet magic
- Course 2: Ask questions to make data driven decisions, Module 4: Always remember the stakeholder
- Course 3: Prepare Data For Exploration: Learning objectives and overviews
- Course 3: Prepare Data For Exploration, Module 1: Data types and structures
- Course 3: Prepare Data For Exploration, Module 2: Data responsibility
- Course 3: Prepare Data For Exploration, Module 3: Database Essentials
- Course 3: Prepare Data For Exploration, Module 4: Organise and Secure Data
- Course 4: Process Data from Dirty to Clean: Overview
- Course 4: Process Data from Dirty to Clean, Module 1: The importance of integrity
- Course 4: Process Data from Dirty to Clean, Module 2: Clean it up
- Course 4: Process Data from Dirty to Clean, Module 3: SQL
- Course 4: Process Data from Dirty to Clean, Module 4: Verify and Report Results
- Course 5: Analyse Data to Answer Questions, Module 1: Organise data for more effective analysis
- Course 5: Analyse Data to Answer Questions, Module 2: Format and adjust data
- Course 5: Analyse Data to Answer Questions, Module 3: Aggregate data for analysis
- Course 5: Analyse Data to Answer Questions, Module 4: Perform Data Calculations
- Course 6: Share Data Through the Art of Visualisation, Course Overview plus Module 1: Visualise Data
- Course 6: Share Data Through the Art of Visualisation, Course Overview plus Module 2: Create Data Visualisation with Tableau
What is dirty data?
Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant to the problem you are trying to solve. This section summarizes:
- Types of dirty data you may encounter
- What may have caused the data to become dirty
- How dirty data is harmful to businesses
Types of dirty data
Duplicate data
Description | Possible causes | Potential harm to businesses |
---|---|---|
Any data record that shows up more than once | Manual data entry, batch data imports, or data migration | Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval |
Outdated data
Description | Possible causes | Potential harm to businesses |
---|---|---|
Any data that is old which should be replaced with newer and more accurate information | People changing roles or companies, or software and systems becoming obsolete | Inaccurate insights, decision-making, and analytics |
Incomplete data
Description | Possible causes | Potential harm to businesses |
---|---|---|
Any data that is missing important fields | Improper data collection or incorrect data entry | Decreased productivity, inaccurate insights, or inability to complete essential services |
Incorrect/inaccurate data
Description | Possible causes | Potential harm to businesses |
---|---|---|
Any data that is complete but inaccurate | Human error inserted during data input, fake information, or mock data | Inaccurate insights or decision-making based on bad information, resulting in revenue loss |
Inconsistent data
Description | Possible causes | Potential harm to businesses |
---|---|---|
Any data that uses different formats to represent the same thing | Data stored incorrectly or errors inserted during data transfer | Contradictory data points leading to confusion or inability to classify or segment customers |
Business impact of dirty data
For further reading on the business impact of dirty data, enter the term “dirty data” into your preferred browser’s search bar to bring up numerous articles on the topic. Here are a few impacts cited for certain industries from a previous search:
- Banking: Inaccuracies cost companies between 15% and 25% of revenue (source).
- Digital commerce: Up to 25% of B2B database contacts contain inaccuracies (source).
- Marketing and sales: 99% of companies are actively tackling data quality in some way (source).
- Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s electronic health records (source).
Dirty data includes duplicate data, outdated data, incomplete data, incorrect or inaccurate data, and inconsistent data. Each type of dirty data can have a significant impact on analyses, leading to inaccurate insights, poor decision-making, and revenue loss. There are a number of causes of dirty data, including manual data entry errors, batch data imports, data migration, software obsolescence, improper data collection, and human errors during data input. As a data professional, you can take steps to mitigate the impact of dirty data by implementing effective data quality processes.
Common mistakes to avoid
- Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most of the time the wrong spelling or common grammatical errors can be detected, but it gets harder with things like names or addresses. For example, if you are working with a spreadsheet table of customer data, you might come across a customer named “John” whose name has been input incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for spelling errors and catch this, your analysis will have mistakes in it.
- Forgetting to document errors: Documenting your errors can be a big time saver, as it helps you avoid those errors in the future by showing you how you resolved them. For example, you might find an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns haven’t been formatted correctly. If you make a note of this fix, you can reference it the next time your formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you keep track of changes in your work, so that you can backtrack if a fix didn’t work.
- Not checking for misfielded values: A misfielded value happens when the values are entered into the wrong field. These values might still be formatted correctly, which makes them harder to catch if you aren’t careful. For example, you might have a dataset with columns for cities and countries. These are the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of Spain in the country column, and Spain had mistakenly been entered into the city column, you would miss key data points. Making sure your data has been entered correctly is key to accurate, complete analysis.
- Overlooking missing values: Missing values in your dataset can create errors and give you inaccurate conclusions. For example, if you were trying to get the total number of sales from the last three months, but a week of transactions were missing, your calculations would be inaccurate. As a best practice, try to keep your data as clean as possible by maintaining completeness and consistency.
- Only looking at a subset of the data: It is important to think about all of the relevant data when you are cleaning. This helps make sure you understand the whole story the data is telling, and that you are paying attention to all possible errors. For example, if you are working with data about bird migration patterns from different sources, but you only clean one source, you might not realize that some of the data is being repeated. This will cause problems in your analysis later on. If you want to avoid common errors like duplicates, each field of your data requires equal attention.
- Losing track of business objectives: When you are cleaning data, you might make new and interesting discoveries about your dataset– but you don’t want those discoveries to distract you from the task at hand. For example, if you were working with weather data to find the average number of rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really interesting, but it isn’t related to the question you are trying to answer right now. Being curious is great! But try not to let it distract you from the task at hand.
- Not fixing the source of the error: Fixing the error itself is important. But if that error is actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing that same error over and over again. For example, imagine you have a team spreadsheet that tracks everyone’s progress. The table keeps breaking because different people are entering different values. You can keep fixing all of these problems one by one, or you can set up your table to streamline data entry so everyone is on the same page. Addressing the source of the errors in your data will save you a lot of time in the long run.
- Not analysing the system prior to data cleaning: If we want to clean our data and avoid future errors, we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would find the cause of the problem before you started fixing the car, right? The same goes for data. First, you figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can control it and keep your data clean.
- Not backing up your data prior to data cleaning: It is always good to be proactive and create your data backup before you start your data clean-up. If your program crashes, or if your changes cause a problem in your dataset, you can always go back to the saved version and restore it. The simple procedure of backing up your data can save you hours of work– and most importantly, a headache.
- Not accounting for data cleaning in your deadlines/process: All good things take time, and that includes data cleaning. It is important to keep that in mind when going through your process and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA.
Additional resources
Refer to these “top ten” lists for data cleaning in Microsoft Excel and Google Sheets to help you avoid the most common mistakes:
- Top ten ways to clean your data: Review an orderly guide to data cleaning in Microsoft Excel.
- 10 Google Workspace tips to clean up data: Learn best practices for data cleaning in Google Sheets.
Workflow automation
Workflow automation is the process of automating parts of your work. That could mean creating an event trigger that sends a notification when a system is updated. Or it could mean automating parts of the data cleaning process. As you can probably imagine, automating different parts of your work can save you tons of time, increase productivity, and give you more bandwidth to focus on other important aspects of the job.
What can be automated?
Automation sounds amazing, doesn’t it? But as convenient as it is, there are still some parts of the job that can’t be automated. Let’s take a look at some things we can automate and some things that we can’t.
Task | Can it be automated? | Why? |
---|---|---|
Communicating with your team and stakeholders | No | Communication is key to understanding the needs of your team and stakeholders as you complete the tasks you are working on. There is no replacement for person-to-person communications. |
Presenting your findings | No | Presenting your data is a big part of your job as a data analyst. Making data accessible and understandable to stakeholders and creating data visualizations can’t be automated for the same reasons that communications can’t be automated. |
Preparing and cleaning data | Partially | Some tasks in data preparation and cleaning can be automated by setting up specific processes, like using a programming script to automatically detect missing values. |
Data exploration | Partially | Sometimes the best way to understand data is to see it. Luckily, there are plenty of tools available that can help automate the process of visualizing data. These tools can speed up the process of visualizing and understanding the data, but the exploration itself still needs to be done by a data analyst. |
Modeling the data | Yes | Data modeling is a difficult process that involves lots of different factors; luckily there are tools that can completely automate the different stages. |
More about automating data cleaning
One of the most important ways you can streamline your data cleaning is to clean data where it lives. This will benefit your whole team, and it also means you don’t have to repeat the process over and over. For example, you could create a programming script that counted the number of words in each spreadsheet file stored in a specific folder. Using tools that can be used where your data is stored means that you don’t have to repeat your cleaning steps, saving you and your team time and energy.
More resources
There are a lot of tools out there that can help automate your processes, and those tools are improving all the time. Here are a few articles or blogs you can check out if you want to learn more about workflow automation and the different tools out there for you to use:
- Towards Data Science’s Automating Scientific Data Analysis
- MIT News’ Automating Big-Data Analysis
- TechnologyAdvice’s 10 of the Best Options for Workflow Automation Software
Develop your approach to cleaning data
As you continue on your data journey, you’re likely discovering that data is often messy—and you can expect raw, primary data to be imperfect. In this reading, you’ll consider how to develop your personal approach to cleaning data. You will explore the idea of a cleaning checklist, which you can use to guide your cleaning process. Then, you’ll define your preferred methods for cleaning data. By the time you complete this reading, you’ll have a better understanding of how to methodically approach the data cleaning process. This will save you time when cleaning data and help you ensure that your data is clean and usable.
Consider your approach to cleaning data
Data cleaning usually requires a lot of time, energy, and attention. But there are two steps you can take before you begin to help streamline your process: creating a cleaning checklist and deciding on your preferred methods. This will help ensure that you know exactly how you want to approach data cleaning and what you need to do to be confident in the integrity of your data.
Your cleaning checklist
Start developing your personal approach to cleaning data by creating a checklist to help you identify problems in your data efficiently and identify the scale and scope of your dataset. Think of this checklist as your default “what to search for” list.
Here are some examples of common data cleaning tasks you could include in your checklist:
- Determine the size of the dataset: Large datasets may have more data quality issues and take longer to process. This may impact your choice of data cleaning techniques and how much time to allocate to the project.
- Determine the number of categories or labels: By understanding the number and nature of categories and labels in a dataset, you can better understand the diversity of the dataset. This understanding also helps inform data merging and migration strategies.
- Identify missing data: Recognizing missing data helps you understand data quality so you can take appropriate steps to remediate the problem. Data integrity is important for accurate and unbiased analysis.
- Identify unformatted data: Identifying improperly or inconsistently formatted data helps analysts ensure data uniformity. This is essential for accurate analysis and visualization.
- Explore the different data types: Understanding the types of data in your dataset (for instance, numerical, categorical, text) helps you select appropriate cleaning methods and apply relevant data analysis techniques.
There might be other data cleaning tasks you’ve been learning about that you also want to prioritize in your checklist. Your checklist is an opportunity for you to define exactly what you want to remember about cleaning your data; feel free to make it your own.
Your preferred cleaning methods
In addition to creating a checklist, identify which actions or tools you prefer using when cleaning data. You’ll use these tools and techniques with each new dataset—or whenever you encounter issues in a dataset—so this list should be compatible with your checklist.
For example, suppose you have a large dataset with missing data. You’ll want to know how to check for missing data in larger datasets, and how you plan to handle any missing data, before you start cleaning. Outlining your preferred methods can save you lots of time and energy.
Glossary terms from module 2
Terms and definitions for Course 4, Module 2
Clean data: Data that is complete, correct, and relevant to the problem being solved
Compatibility: How well two or more datasets are able to work together
CONCATENATE: A spreadsheet function that joins together two or more text strings
Conditional formatting: A spreadsheet tool that changes how cells appear when values meet specific conditions
Data engineer: A professional who transforms data into a useful format for analysis and gives it a reliable infrastructure
Data mapping: The process of matching fields from one data source to another
Data merging: The process of combining two or more datasets into a single dataset
Data validation: A tool for checking the accuracy and quality of data
Data warehousing specialist: A professional who develops processes and procedures to effectively store and organize data
Delimiter: A character that indicates the beginning or end of a data item
Dirty data: Data that is incomplete, incorrect, or irrelevant to the problem to be solved
Duplicate data: Any record that inadvertently shares data with another record
Field length: A tool for determining how many characters can be keyed into a spreadsheet field
Incomplete data: Data that is missing important fields
Inconsistent data: Data that uses different formats to represent the same thing
Incorrect/inaccurate data: Data that is complete but inaccurate
LEFT: A function that returns a set number of characters from the left side of a text string
LEN: A function that returns the length of a text string by counting the number of characters it contains
Length: The number of characters in a text string
Merger: An agreement that unites two organizations into a single new one
MID: A function that returns a segment from the middle of a text string
Null: An indication that a value does not exist in a dataset
Outdated data: Any data that has been superseded by newer and more accurate information
Remove duplicates: A spreadsheet tool that automatically searches for and eliminates duplicate entries from a spreadsheet
Split: A function that divides text around a specified character and puts each fragment into a new, separate cell
Substring: A smaller subset of a text string
Text string: A group of characters within a cell, most often composed of letters
TRIM: A function that removes leading, trailing, and repeated spaces in data
Unique: A value that can’t have a duplicate