- Foundation Course Module 1: Introduction of data analytics and analytical thinking
- Foundation Course Module 2 : The wonderful world of data
- Foundation Course Module 3 : Setup of data analytics toolbox
- Foundation Course Module 4: Becoming a fair and impactful data professional
- Foundation Course: Glossary
- Course 2: Ask questions to make data driven decisions, Module 1: Ask effective questions
- Course 2: Ask questions to make data driven decisions, Module 2: Make data-driven decisions
- Course 2: Ask questions to make data driven decisions, Module 3: Spreadsheet magic
- Course 2: Ask questions to make data driven decisions, Module 4: Always remember the stakeholder
- Course 3: Prepare Data For Exploration: Learning objectives and overviews
- Course 3: Prepare Data For Exploration, Module 1: Data types and structures
- Course 3: Prepare Data For Exploration, Module 2: Data responsibility
- Course 3: Prepare Data For Exploration, Module 3: Database Essentials
- Course 3: Prepare Data For Exploration, Module 4: Organise and Secure Data
- Course 4: Process Data from Dirty to Clean: Overview
- Course 4: Process Data from Dirty to Clean, Module 1: The importance of integrity
- Course 4: Process Data from Dirty to Clean, Module 2: Clean it up
- Course 4: Process Data from Dirty to Clean, Module 3: SQL
- Course 4: Process Data from Dirty to Clean, Module 4: Verify and Report Results
- Course 5: Analyse Data to Answer Questions, Module 1: Organise data for more effective analysis
- Course 5: Analyse Data to Answer Questions, Module 2: Format and adjust data
- Course 5: Analyse Data to Answer Questions, Module 3: Aggregate data for analysis
- Course 5: Analyse Data to Answer Questions, Module 4: Perform Data Calculations
- Course 6: Share Data Through the Art of Visualisation, Course Overview plus Module 1: Visualise Data
- Course 6: Share Data Through the Art of Visualisation, Course Overview plus Module 2: Create Data Visualisation with Tableau
Data Responsibility Rundown
Key Learnings:
- Data Bias: how to identify different types of bias in data, including selection bias, measurement bias, and reporting bias. How to mitigate the effects of bias on your data analysis.
- Data Credibility: how to assess the credibility of data sources and how to verify the accuracy of data. The importance of data ethics and data privacy.
- Open Data: the concept of open data and how it can be used to improve data analysis. Benefits and challenges of using open data.
- Data Ethics and Privacy: the ethical considerations involved in data collection, use, and sharing. The importance of protecting data privacy.
Specific Topics Covered:
- Types of Bias: Selection bias, measurement bias, reporting bias, confirmation bias, and observer bias.
- Mitigating Bias: Techniques for identifying and mitigating bias in data analysis.
- Data Credibility: Assessing the credibility of data sources, verifying data accuracy, and ensuring data integrity.
- Open Data: Definition of open data, benefits and challenges of using open data, and examples of open data sources.
- Data Ethics: Ethical considerations in data collection, use, and sharing.
- Data Privacy: Importance of protecting data privacy, legal and regulatory frameworks for data privacy, and best practices for protecting data privacy.
Data anonymization
What is data anonymization?
We have been learning about the importance of privacy in data analytics. Now, it is time to talk about data anonymization and what types of data should be anonymized. Personally identifiable information, or PII, is information that can be used by itself or with other data to track down a person’s identity.
Data anonymization is the process of protecting people’s private or sensitive data by eliminating that kind of information. Typically, data anonymization involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values.
Your role in data anonymization
Organizations have a responsibility to protect their data and the personal information that data might contain. As a data analyst, you might be expected to understand what data needs to be anonymized, but you generally wouldn’t be responsible for the data anonymization itself. A rare exception might be if you work with a copy of the data for testing or development purposes. In this case, you could be required to anonymize the data before you work with it.
What types of data should be anonymized?
Healthcare and financial data are two of the most sensitive types of data. These industries rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why data in these two industries usually goes through de-identification, which is a process used to wipe data clean of all personally identifying information.
Data anonymization is used in just about every industry. That is why it is so important for data analysts to understand the basics. Here is a list of data that is often anonymized:
- Telephone numbers
- Names
- Licence plates and licence numbers
- Social security numbers
- IP addresses
- Medical records
- Email addresses
- Photographs
- Account numbers
For some people, it just makes sense that this type of data should be anonymized. For others, we have to be very specific about what needs to be anonymized. Imagine a world where we all had access to each other’s addresses, account numbers, and other identifiable information. That would invade a lot of people’s privacy and make the world less safe. Data anonymization is one of the ways we can keep data private and secure!
The open data debate
Just like data privacy, open data is a widely debated topic in today’s world. Data analysts think a lot about open data, and as a data analyst, you need to understand the basics to be successful in your new role.
What is open data?
In data analytics, open data is part of data ethics, which has to do with using data ethically. Openness refers to free access, usage, and sharing of data. But for data to be considered open, it has to:
- Be available and accessible to the public as a complete dataset
- Be provided under terms that allow it to be reused and redistributed
- Allow universal participation so that anyone can use, reuse, and redistribute the data
Data can only be considered open when it meets all three of these standards.
The open data debate: What data should be publicly available?
One of the biggest benefits of open data is that credible databases can be used more widely. Basically, this means that all of that good data can be leveraged, shared, and combined with other data. This could have a huge impact on scientific collaboration, research advances, analytical capacity, and decision-making. But it is important to think about the individuals being represented by the public, open data, too.
Third-party data is collected by an entity that doesn’t have a direct relationship with the data. You might remember learning about this type of data earlier. For example, third parties might collect information about visitors to a certain website. Doing this lets these third parties create audience profiles, which helps them better understand user behaviour and target them with more effective advertising.
Personal identifiable information (PII) is data that is reasonably likely to identify a person and make information known about them. It is important to keep this data safe. PII can include a person’s address, credit card information, social security number, medical records, and more.
Everyone wants to keep personal information about themselves private. Because third-party data is readily available, it is important to balance the openness of data with the privacy of individuals.
Resources for open data
Luckily for data analysts, there are lots of trustworthy resources available for open data. It is important to remember that even reputable data needs to be constantly evaluated, but these websites are a useful starting point:
- UK. government data site: Data.gov.uk is one of the most comprehensive government data sources in the UK. This resource gives users the data and tools that they need to do research, and even helps them develop web and mobile applications and design data visualizations.
- UK Data Service: Discover a wide array of national and international key datasets across various categories. Supported by the University of Essex, University of Manchester, Jisc, UCL and University of Edinburgh. We are funded by UKRI through the Economic and Social Research Council.
- UK government statistical data sets: List of statistical data sets published by the UK government.
- Kaggle: Kaggle has tens of thousands of datasets that are available for public use. Anyone can upload a dataset to Kaggle. If they choose to make it public, other Kagglers can use that dataset to create their own projects.
- Open Data Network: This data source has a really powerful search engine and advanced filters. Here, you can find data on topics like finance, public safety, infrastructure, and housing and development.
- Google Cloud Public Datasets: There are a selection of public datasets available through the Google Cloud Public Dataset Program that you can find already loaded into BigQuery.
- Dataset Search: The Dataset Search is a search engine designed specifically for data sets; you can use this to search for specific data sets.
Glossary terms from module 2
Terms and definitions for Course 3, Module 2
Bad data source: A data source that is not reliable, original, comprehensive, current, and cited (ROCCC)
Bias: A conscious or subconscious preference in favor of or against a person, group of people, or thing
Confirmation bias: The tendency to search for or interpret information in a way that confirms pre-existing beliefs
Consent: The aspect of data ethics that presumes an individual’s right to know how and why their personal data will be used before agreeing to provide it
Cookie: A small file stored on a computer that contains information about its users
Currency: The aspect of data ethics that presumes individuals should be aware of financial transactions resulting from the use of their personal data and the scale of those transactions
Data anonymization: The process of protecting people’s private or sensitive data by eliminating identifying information
Data bias: When a preference in favor of or against a person, group of people, or thing systematically skews data analysis results in a certain direction
Data ethics: Well-founded standards of right and wrong that dictate how data is collected, shared, and used
Data interoperability: A key factor leading to the successful use of open data among companies and governments
Data privacy: Preserving a data subject’s information any time a data transaction occurs
Ethics: Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues
Experimenter bias: The tendency for different people to observe things differently (also called observer bias)
Fairness: A quality of data analysis that does not create or reinforce bias
First-party data: Data collected by an individual or group using their own resources
General Data Protection Regulation of the European Union (GDPR): Policy-making body in the European Union created to help protect people and their data
Good data source: A data source that is reliable, original, comprehensive, current, and cited (ROCCC)
Interpretation bias: The tendency to interpret ambiguous situations in a positive or negative way
Observer bias: The tendency for different people to observe things differently (also called experimenter bias)
Open data: Data that is available to the public
Openness: The aspect of data ethics that promotes the free access, usage, and sharing of data
Sampling bias: Overrepresenting or underrepresenting certain members of a population as a result of working with a sample that is not representative of the population as a whole
Transaction transparency: The aspect of data ethics that presumes all data-processing activities and algorithms should be explainable and understood by the individual who provides the data
Unbiased sampling: When the sample of the population being measured is representative of the population as a whole