- Foundation Course Module 1: Introduction of data analytics and analytical thinking
- Foundation Course Module 2 : The wonderful world of data
- Foundation Course Module 3 : Setup of data analytics toolbox
- Foundation Course Module 4: Becoming a fair and impactful data professional
- Foundation Course: Glossary
- Course 2: Ask questions to make data driven decisions, Module 1: Ask effective questions
- Course 2: Ask questions to make data driven decisions, Module 2: Make data-driven decisions
- Course 2: Ask questions to make data driven decisions, Module 3: Spreadsheet magic
- Course 2: Ask questions to make data driven decisions, Module 4: Always remember the stakeholder
- Course 3: Prepare Data For Exploration: Learning objectives and overviews
- Course 3: Prepare Data For Exploration, Module 1: Data types and structures
- Course 3: Prepare Data For Exploration, Module 2: Data responsibility
- Course 3: Prepare Data For Exploration, Module 3: Database Essentials
- Course 3: Prepare Data For Exploration, Module 4: Organise and Secure Data
- Course 4: Process Data from Dirty to Clean: Overview
- Course 4: Process Data from Dirty to Clean, Module 1: The importance of integrity
- Course 4: Process Data from Dirty to Clean, Module 2: Clean it up
- Course 4: Process Data from Dirty to Clean, Module 3: SQL
- Course 4: Process Data from Dirty to Clean, Module 4: Verify and Report Results
- Course 5: Analyse Data to Answer Questions, Module 1: Organise data for more effective analysis
- Course 5: Analyse Data to Answer Questions, Module 2: Format and adjust data
- Course 5: Analyse Data to Answer Questions, Module 3: Aggregate data for analysis
- Course 5: Analyse Data to Answer Questions, Module 4: Perform Data Calculations
- Course 6: Share Data Through the Art of Visualisation, Course Overview plus Module 1: Visualise Data
- Course 6: Share Data Through the Art of Visualisation, Course Overview plus Module 2: Create Data Visualisation with Tableau
Scenario: calendar dates for a global company
Calendar dates are represented in a lot of different short forms. Depending on where you live, a different format might be used.
- In some countries, 12/10/20 (DD/MM/YY) stands for October 12, 2020.
- In other countries, the national standard is YYYY-MM-DD so October 12, 2020 becomes 2020-10-12.
- In the United States, (MM/DD/YY) is the accepted format, so October 12, 2020 is going to be 10/12/20.
Now, think about what would happen if you were working as a data analyst for a global company and didn’t check date formats. Well, your data integrity would probably be questionable. Any analysis of the data would be inaccurate. Imagine ordering extra inventory for December when it was actually needed in October!
Here are some other things to watch out for:
- Data replication compromising data integrity: Continuing with the example, imagine you ask your international counterparts to verify dates and stick to one format. One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren’t verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates.
- Data transfer compromising data integrity: Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity.
- Data manipulation compromising data integrity: When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness.
Reference: Data constraints and examples
As you progress in your data journey, you’ll come across many types of data constraints (or criteria that determine validity). The table below offers definitions and examples of data constraint terms you might come across.
Data constraint | Definition | Examples |
---|---|---|
Data type | Values must be of a certain type: date, number, percentage, Boolean, etc. | If the data type is a date, a single number like 30 would fail the constraint and be invalid |
Data range | Values must fall between predefined maximum and minimum values | If the data range is 10-20, a value of 30 would fail the constraint and be invalid |
Mandatory | Values can’t be left blank or empty | If age is mandatory, that value must be filled in |
Unique | Values can’t have a duplicate | Two people can’t have the same mobile phone number within the same service area |
Regular expression (regex) patterns | Values must match a prescribed pattern | A phone number must match ###-###-#### (no other characters allowed) |
Cross-field validation | Certain conditions for multiple fields must be satisfied | Values are percentages and values from multiple fields must add up to 100% |
Primary-key | (Databases only) value must be unique per column | A database table can’t have two rows with the same primary key value. A primary key is an identifier in a database that references a column in which each value is unique. More information about primary and foreign keys is provided later in the program. |
Set-membership | (Databases only) values for a column must come from a set of discrete values | Value for a column must be set to Yes, No, or Not Applicable |
Foreign-key | (Databases only) values for a column must be unique values coming from a column in another table | In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table |
Accuracy | The degree to which the data conforms to the actual entity being measured or described | If values for postcodes are validated by street location, the accuracy of the data goes up. |
Completeness | The degree to which the data contains all desired components or measures | If data for personal profiles required hair and eye colour, and both are collected, the data is complete. |
Consistency | The degree to which the data is repeatable from different points of entry or collection | If a customer has the same address in the sales and repair databases, the data is consistent. |
Sample size
Before you dig deeper into sample size, familiarize yourself with these terms and definitions:
Terminology | Definitions |
---|---|
Population | The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company. |
Sample | A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population. |
Margin of error | Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population. |
Confidence level | How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study. |
Confidence interval | The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error. |
Statistical significance | The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance. |
Things to remember when determining the size of your sample
When figuring out a sample size, here are things to keep in mind:
- Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.
- The confidence level most commonly used is 95%, but 90% can work in some cases.
Increase the sample size to meet specific needs of your project:
- For a higher confidence level, use a larger sample size
- To decrease the margin of error, use a larger sample size
- For greater statistical significance, use a larger sample size
Note: Sample size calculators use statistical formulas to determine a sample size. More about these are coming up in the course! Stay tuned.
Why a minimum sample of 30?
This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid. Researchers who rely on regression analysis – statistical methods to determine the relationships between controlled and dependent variables – also prefer a minimum sample of 30.
Still curious? Without getting too much into the math, check out these articles:
- Central Limit Theorem (CLT): This article by Investopedia explains the Central Limit Theorem and briefly describes how it can apply to an analysis of a stock index.
- Sample Size Formula: This article by Statistics Solutions provides a little more detail about why some researchers use 30 as a minimum sample size.
Sample sizes vary by business problem
Sample size will vary based on the type of business problem you are trying to solve.
For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller sample size look like?
Would 200 be alright if the people surveyed represented every district in the city?
Answer: It depends on the stakes.
- A sample size of 200 might be large enough if your business problem is to find out how residents felt about the new library
- A sample size of 200 might not be large enough if your business problem is to determine how residents would vote to fund the library
You could probably accept a larger margin of error surveying how residents feel about the new library versus surveying residents about how they would vote to fund it. For that reason, you would most likely use a larger sample size for the voter survey.
Larger sample sizes have a higher cost
You also have to weigh the cost against the benefits of more accurate results with a larger sample size. Someone who is trying to understand consumer preferences for a new line of products wouldn’t need as large a sample size as someone who is trying to understand the effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a smaller sample size at a lower cost could provide good enough results.
Knowing the basics is helpful
Knowing the basics will help you make the right choices when it comes to sample size. You can always raise concerns if you come across a sample size that is too small. A sample size calculator is also a great tool for this. Sample size calculators let you enter a desired confidence level and margin of error for a given population size. They then calculate the sample size needed to statistically achieve those results.
Refer to the Determine the Best Sample Size video for a demonstration of a sample size calculator, or refer to the Sample Size Calculator reading for additional information.
Pre Cleaning Steps
Step 1: Review Data Integrity
Before data analysts can analyse data, they first need to think about and understand the data they’re working with. Assessing data integrity is a key step in this process. As you’ve learned in previous lessons, you should complete the following tasks before analyzing data:
1. Determine data integrity by assessing the overall accuracy, consistency, and completeness of the data.
2. Connect objectives to data by understanding how your business objectives can be served by an investigation into the data.
3. Know when to stop collecting data.
Data analysts perform pre-cleaning activities to complete these steps. Pre-cleaning activities help you determine and maintain data integrity, which is essential to the role of a junior data analyst.
Step 2: Identify what makes data insufficient
One of the objectives of pre-cleaning activities is to address insufficient data. Recall from previous lessons that data can be insufficient for a number of reasons. Insufficient data has one or more of the following problems:
- Comes from only one source
- Continuously updates and is incomplete
- Is outdated
- Is geographically limited
Step 3: Deal with insufficient data
To deal with insufficient data, you can:
- Identify trends within the available data.
- Wait for more data if time allows.
- Discuss with stakeholders and adjust your objective.
- Search for a new dataset.
Good Read: A Gentle Introduction to Statistical Power and Power Analysis in Python
When data isn’t readily available
Earlier, you learned how you can still do an analysis using proxy data if you have no data. You might have some questions about proxy data, so this reading will give you a few more examples of the types of datasets that can serve as alternate data sources.
Proxy data examples
Sometimes the data to support a business objective isn’t readily available. This is when proxy data is useful. Take a look at the following scenarios and where proxy data comes in for each example:
Business scenario | How proxy data can be used |
---|---|
A new car model was just launched a few days ago and the auto dealership can’t wait until the end of the month for sales data to come in. They want sales projections now. | The analyst proxies the number of clicks to the car specifications on the dealership’s website as an estimate of potential sales at the dealership. |
A brand new plant-based meat product was only recently stocked in grocery stores and the supplier needs to estimate the demand over the next four years. | The analyst proxies the sales data for a turkey substitute made out of tofu that has been on the market for several years. |
The Chamber of Commerce wants to know how a tourism campaign is going to impact travel to their city, but the results from the campaign aren’t publicly available yet. | The analyst proxies the historical data for airline bookings to the city one to three months after a similar campaign was run six months earlier. |
Open (public) datasets
If you are part of a large organisation, you might have access to lots of sources of data. But if you are looking for something specific or a little outside your line of business, you can also make use of open or public datasets. (You can refer to this Medium article for a brief explanation of the difference between open and public data.)
Here’s an example. A nasal version of a vaccine was recently made available. A clinic wants to know what to expect for contraindications, but just started collecting first-party data from its patients. A contraindication is a condition that may cause a patient not to take a vaccine due to the harm it would cause them if taken. To estimate the number of possible contraindications, a data analyst proxies an open dataset from a trial of the injection version of the vaccine. The analyst selects a subset of the data with patient profiles most closely matching the makeup of the patients at the clinic.
There are plenty of ways to share and collaborate on data within a community. Kaggle (kaggle.com) which we previously introduced, has datasets in a variety of formats including the most basic type, Comma Separated Values (CSV) files.
CSV, JSON, SQLite, and BigQuery datasets
- CSV: Check out this Credit card customers dataset, which has information from 10,000 customers including age, salary, marital status, credit card limit, credit card category, etc. (CC0: Public Domain, Sakshi Goyal).
- JSON: Check out this JSON dataset for trending YouTube videos (CC0: Public Domain, Mitchell J).
- SQLite: Check out this SQLite dataset for 24 years worth of U.S. wildfire data (CC0: Public Domain, Rachael Tatman).
- BigQuery: Check out this Google Analytics 360 sample dataset from the Google Merchandise Store (CC0 Public Domain, Google BigQuery).
Refer to the Kaggle documentation for datasets for more information and search for and explore datasets on your own at kaggle.com/datasets.
As with all other kinds of datasets, be on the lookout for duplicate data and ‘Null’ in open datasets. Null most often means that a data field was unassigned (left empty), but sometimes Null can be interpreted as the value, 0. It is important to understand how Null was used before you start analysing a dataset with Null data.
Sample size calculator
In this section, you will learn the basics of sample size calculators, how to use them, and how to understand the results. A sample size calculator tells you how many people you need to interview (or things you need to test) to get results that represent the target population. Let’s review some terms you will come across when using a sample size calculator:
- Confidence level: The probability that your sample size accurately reflects the greater population.
- Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population.
- Population: This is the total number you hope to pull your sample from.
- Sample: A part of a population that is representative of the population.
- Estimated response rate: If you are running a survey of individuals, this is the percentage of people you expect will complete your survey out of those who received the survey.
How to use a sample size calculator
In order to use a sample size calculator, you need to have the population size, confidence level, and the acceptable margin of error already decided so you can input them into the tool. If this information is ready to go, check out these sample size calculators below:
What to do with the results
After you have plugged your information into one of these calculators, it will give you a recommended sample size. Keep in mind, the calculated sample size is the minimum number to achieve what you input for confidence level and margin of error. If you are working with a survey, you will also need to think about the estimated response rate to figure out how many surveys you will need to send out. For example, if you need a sample size of 100 individuals and your estimated response rate is 10%, you will need to send your survey to 1,000 individuals to get the 100 responses you need for your analysis.
Now that you have the basics, try some calculations using the sample size calculators and refer back to this reading if you need a refresher on the definitions.
All about margin of error
Margin of error is the maximum amount that the sample results are expected to differ from those of the actual population. More technically, the margin of error defines a range of values below and above the average result for the sample. The average result for the entire population is expected to be within that range. We can better understand margin of error by using some examples below.
Margin of error in baseball
Imagine you are playing baseball and that you are up at bat. The crowd is roaring, and you are getting ready to try to hit the ball. The pitcher delivers a fastball traveling about 90-95mph, which takes about 400 milliseconds (ms) to reach the catcher’s glove. You swing and miss the first pitch because your timing was a little off. You wonder if you should have swung slightly earlier or slightly later to hit a home run. That time difference can be considered the margin of error, and it tells us how close or far your timing was from the average home run swing.
Margin of error in marketing
The margin of error is also important in marketing. Let’s use A/B testing as an example. A/B testing (or split testing) tests two variations of the same web page to determine which page is more successful in attracting user traffic and generating revenue. User traffic that gets monetized is known as the conversion rate. A/B testing allows marketers to test emails, ads, and landing pages to find the data behind what is working and what isn’t working. Marketers use the confidence interval (determined by the conversion rate and the margin of error) to understand the results.
For example, suppose you are conducting an A/B test to compare the effectiveness of two different email subject lines to entice people to open the email. You find that subject line A: “Special offer just for you” resulted in a 5% open rate compared to subject line B: “Don’t miss this opportunity” at 3%.
Does that mean subject line A is better than subject line B? It depends on your margin of error. If the margin of error was 2%, then subject line A’s actual open rate or confidence interval is somewhere between 3% and 7%. Since the lower end of the interval overlaps with subject line B’s results at 3%, you can’t conclude that there is a statistically significant difference between subject line A and B. Examining the margin of error is important when making conclusions based on your test results.
Want to calculate your margin of error?
All you need is population size, confidence level, and sample size. In order to better understand this calculator, review these terms:
- Confidence level: A percentage indicating how likely your sample accurately reflects the greater population
- Population: The total number you pull your sample from
- Sample: A part of a population that is representative of the population
- Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population
In most cases, a 90% or 95% confidence level is used. But, depending on your industry, you might want to set a stricter confidence level. A 99% confidence level is reasonable in some industries, such as the pharmaceutical industry.
After you have settled on your population size, sample size, and confidence level, plug the information into a margin of error calculator like the ones below:
- Margin of error calculator by Good Calculators (free online calculators)
- Margin of error calculator by CheckMarket
Glossary terms from module 1
Terms and definitions for Course 4, Module 1
Accuracy: The degree to which the data conforms to the actual entity being measured or described
Completeness: The degree to which the data contains all desired components or measures
Confidence interval: A range of values that conveys how likely a statistical estimate reflects the population
Confidence level: The probability that a sample size accurately reflects the greater population
Consistency: The degree to which data is repeatable from different points of entry or collection
Cross-field validation: A process that ensures certain conditions for multiple data fields are satisfied
Data constraints: The criteria that determine whether a piece of a data is clean and valid
Data integrity: The accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle
Data manipulation: The process of changing data to make it more organized and easier to read
Data range: Numerical values that fall between predefined maximum and minimum values
Data replication: The process of storing data in multiple locations
DATEDIF: A spreadsheet function that calculates the number of days, months, or years between two dates
Estimated response rate: The average number of people who typically complete a survey
Hypothesis testing: A process to determine if a survey or experiment has meaningful results
Mandatory: A data value that cannot be left blank or empty
Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population
Random sampling: A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen
Regular expression (RegEx): A rule that says the values in a table must match a prescribed pattern