Google Data Analysis Professional Certification

Data Responsibility Rundown

Key Learnings:

Data Bias: how to identify different types of bias in data, including selection bias, measurement bias, and reporting bias. How to mitigate the effects of bias on your data analysis.
Data Credibility: how to assess the credibility of data sources and how to verify the accuracy of data. The importance of data ethics and data privacy.
Open Data: the concept of open data and how it can be used to improve data analysis. Benefits and challenges of using open data.
Data Ethics and Privacy: the ethical considerations involved in data collection, use, and sharing. The importance of protecting data privacy.

Specific Topics Covered:

Types of Bias: Selection bias, measurement bias, reporting bias, confirmation bias, and observer bias.
Mitigating Bias: Techniques for identifying and mitigating bias in data analysis.
Data Credibility: Assessing the credibility of data sources, verifying data accuracy, and ensuring data integrity.
Open Data: Definition of open data, benefits and challenges of using open data, and examples of open data sources.
Data Ethics: Ethical considerations in data collection, use, and sharing.
Data Privacy: Importance of protecting data privacy, legal and regulatory frameworks for data privacy, and best practices for protecting data privacy.

Data anonymization

What is data anonymization?

We have been learning about the importance of privacy in data analytics. Now, it is time to talk about data anonymization and what types of data should be anonymized. Personally identifiable information, or PII, is information that can be used by itself or with other data to track down a person’s identity.

Data anonymization is the process of protecting people’s private or sensitive data by eliminating that kind of information. Typically, data anonymization involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values.

Your role in data anonymization

Organizations have a responsibility to protect their data and the personal information that data might contain. As a data analyst, you might be expected to understand what data needs to be anonymized, but you generally wouldn’t be responsible for the data anonymization itself. A rare exception might be if you work with a copy of the data for testing or development purposes. In this case, you could be required to anonymize the data before you work with it.

What types of data should be anonymized?

Healthcare and financial data are two of the most sensitive types of data. These industries rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why data in these two industries usually goes through de-identification, which is a process used to wipe data clean of all personally identifying information.

Data anonymization is used in just about every industry. That is why it is so important for data analysts to understand the basics. Here is a list of data that is often anonymized:

Telephone numbers
Names
Licence plates and licence numbers
Social security numbers
IP addresses
Medical records
Email addresses
Photographs
Account numbers

For some people, it just makes sense that this type of data should be anonymized. For others, we have to be very specific about what needs to be anonymized. Imagine a world where we all had access to each other’s addresses, account numbers, and other identifiable information. That would invade a lot of people’s privacy and make the world less safe. Data anonymization is one of the ways we can keep data private and secure!

The open data debate

Just like data privacy, open data is a widely debated topic in today’s world. Data analysts think a lot about open data, and as a data analyst, you need to understand the basics to be successful in your new role.

What is open data?

In data analytics, open data is part of data ethics, which has to do with using data ethically. Openness refers to free access, usage, and sharing of data. But for data to be considered open, it has to:

Be available and accessible to the public as a complete dataset
Be provided under terms that allow it to be reused and redistributed
Allow universal participation so that anyone can use, reuse, and redistribute the data

Data can only be considered open when it meets all three of these standards.

The open data debate: What data should be publicly available?

One of the biggest benefits of open data is that credible databases can be used more widely. Basically, this means that all of that good data can be leveraged, shared, and combined with other data. This could have a huge impact on scientific collaboration, research advances, analytical capacity, and decision-making. But it is important to think about the individuals being represented by the public, open data, too.

Third-party data is collected by an entity that doesn’t have a direct relationship with the data. You might remember learning about this type of data earlier. For example, third parties might collect information about visitors to a certain website. Doing this lets these third parties create audience profiles, which helps them better understand user behaviour and target them with more effective advertising.

Personal identifiable information (PII) is data that is reasonably likely to identify a person and make information known about them. It is important to keep this data safe. PII can include a person’s address, credit card information, social security number, medical records, and more.

Everyone wants to keep personal information about themselves private. Because third-party data is readily available, it is important to balance the openness of data with the privacy of individuals.

Resources for open data

Luckily for data analysts, there are lots of trustworthy resources available for open data. It is important to remember that even reputable data needs to be constantly evaluated, but these websites are a useful starting point:

UK. government data site: Data.gov.uk is one of the most comprehensive government data sources in the UK. This resource gives users the data and tools that they need to do research, and even helps them develop web and mobile applications and design data visualizations.
UK Data Service: Discover a wide array of national and international key datasets across various categories. Supported by the University of Essex, University of Manchester, Jisc, UCL and University of Edinburgh. We are funded by UKRI through the Economic and Social Research Council.
UK government statistical data sets: List of statistical data sets published by the UK government.
Kaggle: Kaggle has tens of thousands of datasets that are available for public use. Anyone can upload a dataset to Kaggle. If they choose to make it public, other Kagglers can use that dataset to create their own projects.
Open Data Network: This data source has a really powerful search engine and advanced filters. Here, you can find data on topics like finance, public safety, infrastructure, and housing and development.
Google Cloud Public Datasets: There are a selection of public datasets available through the Google Cloud Public Dataset Program that you can find already loaded into BigQuery.
Dataset Search: The Dataset Search is a search engine designed specifically for data sets; you can use this to search for specific data sets.

Glossary terms from module 2

Terms and definitions for Course 3, Module 2

Bad data source: A data source that is not reliable, original, comprehensive, current, and cited (ROCCC)

Bias: A conscious or subconscious preference in favor of or against a person, group of people, or thing

Confirmation bias: The tendency to search for or interpret information in a way that confirms pre-existing beliefs

Consent: The aspect of data ethics that presumes an individual’s right to know how and why their personal data will be used before agreeing to provide it

Cookie: A small file stored on a computer that contains information about its users

Currency: The aspect of data ethics that presumes individuals should be aware of financial transactions resulting from the use of their personal data and the scale of those transactions

Data anonymization: The process of protecting people’s private or sensitive data by eliminating identifying information

Data bias: When a preference in favor of or against a person, group of people, or thing systematically skews data analysis results in a certain direction

Data ethics: Well-founded standards of right and wrong that dictate how data is collected, shared, and used

Data interoperability: A key factor leading to the successful use of open data among companies and governments

Data privacy: Preserving a data subject’s information any time a data transaction occurs

Ethics: Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues

Experimenter bias: The tendency for different people to observe things differently (also called observer bias)

Fairness: A quality of data analysis that does not create or reinforce bias

First-party data: Data collected by an individual or group using their own resources

General Data Protection Regulation of the European Union (GDPR): Policy-making body in the European Union created to help protect people and their data

Good data source: A data source that is reliable, original, comprehensive, current, and cited (ROCCC)

Interpretation bias: The tendency to interpret ambiguous situations in a positive or negative way

Observer bias: The tendency for different people to observe things differently (also called experimenter bias)

Open data: Data that is available to the public

Openness: The aspect of data ethics that promotes the free access, usage, and sharing of data

Sampling bias: Overrepresenting or underrepresenting certain members of a population as a result of working with a sample that is not representative of the population as a whole

Transaction transparency: The aspect of data ethics that presumes all data-processing activities and algorithms should be explainable and understood by the individual who provides the data

Unbiased sampling: When the sample of the population being measured is representative of the population as a whole

Tagged google data analytics, data responibility

MindSpace

ad astra per aspera

Course 3: Prepare Data For Exploration, Module 2: Data responsibility

Data anonymization

What is data anonymization?

Your role in data anonymization

What types of data should be anonymized?

The open data debate

What is open data?

The open data debate: What data should be publicly available?

Resources for open data

Glossary terms from module 2

Terms and definitions for Course 3, Module 2

Google Data Analysis Professional Certification

Related

About J_Tusar

Leave a Reply Cancel reply

MindSpace

ad astra per aspera

Data anonymization

What is data anonymization?

Your role in data anonymization

What types of data should be anonymized?

The open data debate

What is open data?

The open data debate: What data should be publicly available?

Resources for open data

Glossary terms from module 2

Terms and definitions for Course 3, Module 2

Google Data Analysis Professional Certification

Related

Related Posts

How I Turned a Simple User Story into a Full Serverless CI/CD Pipeline on AWS

Why Learning Git Is a Fundamental Skill for Cloud and Software Engineers

Best free Terraform learning resources

About J_Tusar

Leave a Reply Cancel reply

Discover more from MindSpace