Technical, Data Analytics

Course 4: Process Data from Dirty to Clean, Module 1: The importance of integrity

Scenario: calendar dates for a global company

Calendar dates are represented in a lot of different short forms. Depending on where you live, a different format might be used. 

  • In some countries, 12/10/20 (DD/MM/YY) stands for October 12, 2020. 
  • In other countries, the national standard is YYYY-MM-DD so October 12, 2020 becomes 2020-10-12
  • In the United States, (MM/DD/YY) is the accepted format, so October 12, 2020 is going to be 10/12/20.

Now, think about what would happen if you were working as a data analyst for a global company and didn’t check date formats. Well, your data integrity would probably be questionable. Any analysis of the data would be inaccurate. Imagine ordering extra inventory for December when it was actually needed in October!

Here are some other things to watch out for:

  • Data replication compromising data integrity: Continuing with the example, imagine you ask your international counterparts to verify dates and stick to one format. One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren’t verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates.
  • Data transfer compromising data integrity: Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity. 
  • Data manipulation compromising data integrity: When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness.

Reference: Data constraints and examples

As you progress in your data journey, you’ll come across many types of data constraints (or criteria that determine validity). The table below offers definitions and examples of data constraint terms you might come across.

Data constraintDefinitionExamples
Data typeValues must be of a certain type: date, number, percentage, Boolean, etc.If the data type is a date, a single number like 30 would fail the constraint and be invalid
Data rangeValues must fall between predefined maximum and minimum valuesIf the data range is 10-20, a value of 30 would fail the constraint and be invalid
MandatoryValues can’t be left blank or emptyIf age is mandatory, that value must be filled in
UniqueValues can’t have a duplicateTwo people can’t have the same mobile phone number within the same service area
Regular expression (regex) patternsValues must match a prescribed patternA phone number must match ###-###-#### (no other characters allowed)
Cross-field validationCertain conditions for multiple fields must be satisfiedValues are percentages and values from multiple fields must add up to 100%
Primary-key(Databases only) value must be unique per columnA database table can’t have two rows with the same primary key value. A primary key is an identifier in a database that references a column in which each value is unique. More information about primary and foreign keys is provided later in the program.
Set-membership(Databases only) values for a column must come from a set of discrete valuesValue for a column must be set to Yes, No, or Not Applicable
Foreign-key(Databases only) values for a column must be unique values coming from a column in another tableIn a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table
AccuracyThe degree to which the data conforms to the actual entity being measured or describedIf values for postcodes are validated by street location, the accuracy of the data goes up.
CompletenessThe degree to which the data contains all desired components or measuresIf data for personal profiles required hair and eye colour, and both are collected, the data is complete.
ConsistencyThe degree to which the data is repeatable from different points of entry or collectionIf a customer has the same address in the sales and repair databases, the data is consistent.

Sample size

Before you dig deeper into sample size, familiarize yourself with these terms and definitions:

TerminologyDefinitions
Population The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.
Sample A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population.
Margin of errorSince a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population. 
Confidence levelHow confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study. 
Confidence intervalThe range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.
Statistical significanceThe determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.

Things to remember when determining the size of your sample

When figuring out a sample size, here are things to keep in mind:

  • Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.
  • The confidence level most commonly used is 95%, but 90% can work in some cases. 

Increase the sample size to meet specific needs of your project:

  • For a higher confidence level, use a larger sample size
  • To decrease the margin of error, use a larger sample size
  • For greater statistical significance, use a larger sample size

Note: Sample size calculators use statistical formulas to determine a sample size. More about these are coming up in the course! Stay tuned.

Why a minimum sample of 30?

This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid. Researchers who rely on regression analysis statistical methods to determine the relationships between controlled and dependent variables – also prefer a minimum sample of 30.

Still curious? Without getting too much into the math, check out these articles:

  • Central Limit Theorem (CLT): This article by Investopedia explains the Central Limit Theorem and briefly describes how it can apply to an analysis of a stock index.
  • Sample Size Formula: This article by Statistics Solutions provides a little more detail about why some researchers use 30 as a minimum sample size.

Sample sizes vary by business problem

Sample size will vary based on the type of business problem you are trying to solve. 

For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller sample size look like? 

Would 200 be alright if the people surveyed represented every district in the city?

Answer: It depends on the stakes. 

  • A sample size of 200 might be large enough if your business problem is to find out how residents felt about the new library
  • A sample size of 200 might not be large enough if your business problem is to determine how residents would vote to fund the library

You could probably accept a larger margin of error surveying how residents feel about the new library versus surveying residents about how they would vote to fund it. For that reason, you would most likely use a larger sample size for the voter survey.

Larger sample sizes have a higher cost

You also have to weigh the cost against the benefits of more accurate results with a larger sample size. Someone who is trying to understand consumer preferences for a new line of products wouldn’t need as large a sample size as someone who is trying to understand the effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a smaller sample size at a lower cost could provide good enough results. 

Knowing the basics is helpful

Knowing the basics will help you make the right choices when it comes to sample size. You can always raise concerns if you come across a sample size that is too small. A sample size calculator is also a great tool for this. Sample size calculators let you enter a desired confidence level and margin of error for a given population size. They then calculate the sample size needed to statistically achieve those results. 

Refer to the Determine the Best Sample Size video for a demonstration of a sample size calculator, or refer to the Sample Size Calculator reading for additional information.

Pre Cleaning Steps

Step 1: Review Data Integrity

Before data analysts can analyse data, they first need to think about and understand the data they’re working with. Assessing data integrity is a key step in this process. As you’ve learned in previous lessons, you should complete the following tasks before analyzing data: 

1. Determine data integrity by assessing the overall accuracy, consistency, and completeness of the data.

2. Connect objectives to data by understanding how your business objectives can be served by an investigation into the data.

3. Know when to stop collecting data.

Data analysts perform pre-cleaning activities to complete these steps. Pre-cleaning activities help you determine and maintain data integrity, which is essential to the role of a junior data analyst.

Step 2: Identify what makes data insufficient

One of the objectives of pre-cleaning activities is to address insufficient data. Recall from previous lessons that data can be insufficient for a number of reasons. Insufficient data has one or more of the following problems:

  • Comes from only one source
  • Continuously updates and is incomplete
  • Is outdated
  • Is geographically limited

Step 3: Deal with insufficient data

To deal with insufficient data, you can:

  • Identify trends within the available data.
  • Wait for more data if time allows.
  • Discuss with stakeholders and adjust your objective.
  • Search for a new dataset.

Good Read: A Gentle Introduction to Statistical Power and Power Analysis in Python

When data isn’t readily available

Earlier, you learned how you can still do an analysis using proxy data if you have no data. You might have some questions about proxy data, so this reading will give you a few more examples of the types of datasets that can serve as alternate data sources.

Proxy data examples

Sometimes the data to support a business objective isn’t readily available. This is when proxy data is useful. Take a look at the following scenarios and where proxy data comes in for each example:

Business scenarioHow proxy data can be used
A new car model was just launched a few days ago and the auto dealership can’t wait until the end of the month for sales data to come in. They want sales projections now.The analyst proxies the number of clicks to the car specifications on the dealership’s website as an estimate of potential sales at the dealership.
A brand new plant-based meat product was only recently stocked in grocery stores and the supplier needs to estimate the demand over the next four years. The analyst proxies the sales data for a turkey substitute made out of tofu that has been on the market for several years.
The Chamber of Commerce wants to know how a tourism campaign is going to impact travel to their city, but the results from the campaign aren’t publicly available yet.The analyst proxies the historical data for airline bookings to the city one to three months after a similar campaign was run six months earlier.

Open (public) datasets

If you are part of a large organisation, you might have access to lots of sources of data. But if you are looking for something specific or a little outside your line of business, you can also make use of open or public datasets. (You can refer to this Medium article for a brief explanation of the difference between open and public data.)

Here’s an example. A nasal version of a vaccine was recently made available. A clinic wants to know what to expect for contraindications, but just started collecting first-party data from its patients. A contraindication is a condition that may cause a patient not to take a vaccine due to the harm it would cause them if taken. To estimate the number of possible contraindications, a data analyst proxies an open dataset from a trial of the injection version of the vaccine. The analyst selects a subset of the data with patient profiles most closely matching the makeup of the patients at the clinic. 

There are plenty of ways to share and collaborate on data within a community. Kaggle (kaggle.com) which we previously introduced, has datasets in a variety of formats including the most basic type, Comma Separated Values (CSV) files.  

CSV, JSON, SQLite, and BigQuery datasets

  • CSV: Check out this Credit card customers dataset, which has information from 10,000 customers including age, salary, marital status, credit card limit, credit card category, etc. (CC0: Public Domain, Sakshi Goyal).
  • JSON: Check out this JSON dataset for trending YouTube videos (CC0: Public Domain, Mitchell J).
  • SQLite: Check out this SQLite dataset for 24 years worth of U.S. wildfire data (CC0: Public Domain, Rachael Tatman).
  • BigQuery: Check out this Google Analytics 360 sample dataset from the Google Merchandise Store (CC0 Public Domain, Google BigQuery).

Refer to the Kaggle documentation for datasets for more information and search for and explore datasets on your own at kaggle.com/datasets.

As with all other kinds of datasets, be on the lookout for duplicate data and ‘Null’ in open datasets. Null most often means that a data field was unassigned (left empty), but sometimes Null can be interpreted as the value, 0. It is important to understand how Null was used before you start analysing a dataset with Null data.

Sample size calculator

In this section, you will learn the basics of sample size calculators, how to use them, and how to understand the results. A sample size calculator tells you how many people you need to interview (or things you need to test) to get results that represent the target population. Let’s review some terms you will come across when using a sample size calculator:

  • Confidence level: The probability that your sample size accurately reflects the greater population.
  • Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population.
  • Population: This is the total number you hope to pull your sample from.
  • Sample: A part of a population that is representative of the population.
  • Estimated response rate: If you are running a survey of individuals, this is the percentage of people you expect will complete your survey out of those who received the survey.

How to use a sample size calculator

In order to use a sample size calculator, you need to have the population size, confidence level, and the acceptable margin of error already decided so you can input them into the tool. If this information is ready to go, check out these sample size calculators below:

What to do with the results

After you have plugged your information into one of these calculators, it will give you a recommended sample size. Keep in mind, the calculated sample size is the minimum number to achieve what you input for confidence level and margin of error. If you are working with a survey, you will also need to think about the estimated response rate to figure out how many surveys you will need to send out. For example, if you need a sample size of 100 individuals and your estimated response rate is 10%, you will need to send your survey to 1,000 individuals to get the 100 responses you need for your analysis.

Now that you have the basics, try some calculations using the sample size calculators and refer back to this reading if you need a refresher on the definitions. 

All about margin of error

Margin of error is the maximum amount that the sample results are expected to differ from those of the actual population. More technically, the margin of error defines a range of values below and above the average result for the sample. The average result for the entire population is expected to be within that range. We can better understand margin of error by using some examples below.

Margin of error in baseball

Imagine you are playing baseball and that you are up at bat. The crowd is roaring, and you are getting ready to try to hit the ball. The pitcher delivers a fastball traveling about 90-95mph, which takes about 400 milliseconds (ms) to reach the catcher’s glove. You swing and miss the first pitch because your timing was a little off. You wonder if you should have swung slightly earlier or slightly later to hit a home run. That time difference can be considered the margin of error, and it tells us how close or far your timing was from the average home run swing.    

Margin of error in marketing

The margin of error is also important in marketing. Let’s use A/B testing as an example. A/B testing (or split testing) tests two variations of the same web page to determine which page is more successful in attracting user traffic and generating revenue. User traffic that gets monetized is known as the conversion rate. A/B testing allows marketers to test emails, ads, and landing pages to find the data behind what is working and what isn’t working. Marketers use the confidence interval (determined by the conversion rate and the margin of error) to understand the results. 

For example, suppose you are conducting an A/B test to compare the effectiveness of two different email subject lines to entice people to open the email. You find that subject line A: “Special offer just for you” resulted in a 5% open rate compared to subject line B: “Don’t miss this opportunity” at 3%. 

Does that mean subject line A is better than subject line B? It depends on your margin of error. If the margin of error was 2%, then subject line A’s actual open rate or confidence interval is somewhere between 3% and 7%. Since the lower end of the interval overlaps with subject line B’s results at 3%, you can’t conclude that there is a statistically significant difference between subject line A and B. Examining the margin of error is important when making conclusions based on your test results. 

Want to calculate your margin of error? 

All you need is population size, confidence level, and sample size. In order to better understand this calculator, review these terms:

  • Confidence level: A percentage indicating how likely your sample accurately reflects the greater population 
  • Population: The total number you pull your sample from
  • Sample: A part of a population that is representative of the population
  • Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population

In most cases, a 90% or 95% confidence level is used. But, depending on your industry, you might want to set a stricter confidence level. A 99% confidence level is reasonable in some industries, such as the pharmaceutical industry. 

After you have settled on your population size, sample size, and confidence level, plug the information into a margin of error calculator like the ones below: 

Glossary terms from module 1

Terms and definitions for Course 4, Module 1

Accuracy: The degree to which the data conforms to the actual entity being measured or described

Completeness: The degree to which the data contains all desired components or measures

Confidence interval:  A range of values that conveys how likely a statistical estimate reflects the population

Confidence level: The probability that a sample size accurately reflects the greater population

Consistency: The degree to which data is repeatable from different points of entry or collection

Cross-field validation: A process that ensures certain conditions for multiple data fields are satisfied

Data constraints: The criteria that determine whether a piece of a data is clean and valid

Data integrity: The accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle

Data manipulation: The process of changing data to make it more organized and easier to read

Data range: Numerical values that fall between predefined maximum and minimum values

Data replication: The process of storing data in multiple locations

DATEDIF: A spreadsheet function that calculates the number of days, months, or years between two dates

Estimated response rate: The average number of people who typically complete a survey

Hypothesis testing: A process to determine if a survey or experiment has meaningful results

Mandatory: A data value that cannot be left blank or empty

Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population

Random sampling: A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen

Regular expression (RegEx): A rule that says the values in a table must match a prescribed pattern

Series Navigation<< Course 4: Process Data from Dirty to Clean: OverviewCourse 4: Process Data from Dirty to Clean, Module 2: Clean it up >>
Tagged

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.