Technical, Data Analytics

Course 3: Prepare Data For Exploration, Module 1: Data types and structures

Select the right data

Following are some data-collection considerations to keep in mind for your analysis:

How the data will be collected

Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party. Data that you collect yourself is called first-party data.

Data sources

If you don’t collect the data using your own resources, you might get data from second-party or third-party data providers. Second-party data is collected directly by another group and then sold. Third-party data is sold by a provider that didn’t collect the data themselves. Third-party data might come from a number of different sources.

Solving your business problem

Datasets can show a lot of interesting information. But be sure to choose data that can actually help solve your problem question. For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates.

How much data to collect

If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs. 

Time frame

If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use historical data that already exists.

Use the flowchart below if data collection relies heavily on how much time you have:

Data formats in practice

When you think about the word “format,” a lot of things might come to mind. Think of an advertisement for your favourite store. You might find it in the form of a print ad, a billboard, or even a commercial. The information is presented in the format that works best for you to take it in. The format of a dataset is a lot like that, and choosing the right format will help you manage and use your data in the best way possible.

Data format examples

As with most things, it is easier for definitions to click when you can pair them with examples you might encounter on a daily basis. Review each data format’s definition first and then use the examples to lock in your understanding.

Primary versus secondary data

The following table highlights the differences between primary and secondary data and presents examples of each. 

Data format classificationDefinitionExamples
Primary dataCollected by a researcher from first-hand sourcesData from an interview you conducted – Data from a survey returned from 20 participants Data from questionnaires you got back from a group of workers
Secondary dataGathered by other people or from other researchData you bought from a local data analytics firm’s customer profiles Demographic data collected by a university Census data gathered by the federal government

Internal versus external data

The following table highlights the differences between internal and external data and presents examples of each. 

Data format classificationDefinitionExamples
Internal dataData that is stored inside a company’s own systemsWages of employees across different business units tracked by HR Sales data by store location Product inventory levels across distribution centres
External dataData that is stored outside of a company or organizationNational average wages for the various positions throughout your organization Credit reports for customers of an auto dealership

Continuous versus discrete data

The following table highlights the differences between continuous and discrete data and presents examples of each.

Data format classificationDefinitionExamples
Continuous dataData that is measured and can have almost any numeric valueHeight of kids in third grade classes (52.5 inches, 65.7 inches) Runtime markers in a videoTemperature
Discrete dataData that is counted and has a limited number of valuesNumber of people who visit a hospital on a daily basis (10, 20, 200)Maximum capacity allowed in a room Tickets sold in the current month

Qualitative versus quantitative data

The following table highlights the differences between qualitative and quantitative data and presents examples of each.

Data format classificationDefinitionExamples
QualitativeA subjective and explanatory measure of a quality or characteristicFavourite exercise activity Brand with the best customer service Fashion preferences of young adults
QuantitativeA specific and objective measure, such as a number, quantity, or rangePercentage of board certified doctors who are women Population size of elephants in Africa Distance from Earth to Mars at a particular time

Nominal versus ordinal data

The following table highlights the differences between nominal and ordinal data and presents examples of each.

Data format classificationDefinitionExamples
NominalA type of qualitative data that is categorized without a set orderFirst time customer, returning customer, regular customer New job applicant, existing applicant, internal applicant New listing, reduced price listing, foreclosure
OrdinalA type of qualitative data with a set order or scaleMovie ratings (number of stars: 1 star, 2 stars, 3 stars) Ranked-choice voting selections (1st, 2nd, 3rd) Satisfaction level measured in a survey (satisfied, neutral, dissatisfied)

Structured versus unstructured data

The following table highlights the differences between structured and unstructured data and presents examples of each.

Data format classificationDefinitionExamples
Structured dataData organized in a certain format, like rows and columnsExpense reports Tax returns Store inventory
Unstructured dataData that cannot be stored as columns and rows in a relational database. Social media posts Emails Videos

The effects of different structures

Data is everywhere and it can be stored in lots of ways. Two general categories of data are: 

  • Structured data: Organized in a certain format, such as rows and columns.
  • Unstructured data: Not organized in any easy-to-identify way.

For example, when you rate your favourite restaurant online, you’re creating structured data. But when you use Google Earth to check out a satellite image of a restaurant location, you’re using unstructured data. 

Here’s a refresher on the characteristics of structured and unstructured data:

Structured vs unstructured data

Structured data

As we described earlier, structured data is organized in a certain format. This makes it easier to store and query for business needs. If the data is exported, the structure goes along with the data.

Unstructured data

Unstructured data can’t be organized in any easily identifiable manner. And there is much more unstructured than structured data in the world. Video and audio files, text files, social media content, satellite imagery, presentations, PDF files, open-ended survey responses, and websites all qualify as types of unstructured data.

The fairness issue

The lack of structure makes unstructured data difficult to search, manage, and analyse. But recent advancements in artificial intelligence and machine learning algorithms are beginning to change that. Now, the new challenge facing data scientists is making sure these tools are inclusive and unbiased. Otherwise, certain elements of a dataset will be more heavily weighted and/or represented than others. And as you’re learning, an unfair dataset does not accurately represent the population, causing skewed outcomes, low accuracy levels, and unreliable analysis.

Data modelling levels and techniques

This reading introduces you to data modelling and different types of data models. Data models help keep data consistent and enable people to map out how data is organized. A basic understanding makes it easier for analysts and other stakeholders to make sense of their data and use it in the right ways.

Important note: As a junior data analyst, you won’t be asked to design a data model. But you might come across existing data models your organization already has in place.

What is data modelling?

Data modelling is the process of creating diagrams that visually represent how data is organized and structured.  These visual representations are called data models. You can think of data modelling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole. 

Levels of data modelling

Each level of data modelling has a different level of detail.

  1. Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn’t contain technical details.
  2. Logical data modeling focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn’t spell out actual names of database tables. That’s the job of a physical data model.
  3. Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.

More information can be found in this comparison of data models.

Data-modeling techniques

There are a lot of approaches when it comes to developing data models, but two common methods are the Entity Relationship Diagram (ERD) and the Unified Modeling Language (UML) diagram. ERDs are a visual way to understand the relationship between entities in the data model. UML diagrams are very detailed diagrams that describe the structure of a system by showing the system’s entities, attributes, operations, and their relationships. As a junior data analyst, you will need to understand that there are different data modeling techniques, but in practice, you will probably be using your organization’s existing technique. 

You can read more about ERD, UML, and data dictionaries in this data modeling techniques article.

Data analysis and data modeling

Data modeling can help you explore the high-level details of your data and how it is related across the organization’s information systems. Data modeling sometimes requires data analysis to understand how the data is put together; that way, you know how to map the data. And finally, data models make it easier for everyone in your organization to understand and collaborate with you on your data. This is important for you and everyone on your team!

Use Boolean logic

In this reading, you will explore the basics of Boolean logic and learn how to use single and multiple conditions in a Boolean statement. These conditions are created with Boolean operators, including AND, OR, and NOT. These operators are similar to mathematical operators and can be used to create logical statements that filter your results. Data analysts use Boolean statements to do a wide range of data analysis tasks, such as writing queries for searches and checking for conditions when writing programming code.

Boolean logic example

Imagine you are shopping for shoes, and are considering certain preferences:

  • You will buy the shoes only if they are any combination of pink and grey
  • You will buy the shoes if they are entirely pink, entirely grey, or if they are pink and grey
  • You will buy the shoes if they are grey, but not if they have any pink

These Venn diagrams illustrate your shoe preferences. AND is the center of the Venn diagram, where two conditions overlap. OR includes either condition. NOT includes only the part of the Venn diagram that doesn’t contain the exception.

Use Boolean logic in statements

In queries, Boolean logic is represented in a statement written with Boolean operators. An operator is a symbol that names the operation or calculation to be performed. Read on to discover how you can convert your shoe preferences into Boolean statements.

The AND operator

Your condition is “If the color of the shoe has any combination of grey and pink, you will buy them.” The Boolean statement would break down the logic of that statement to filter your results by both colors. It would say IF (Color=”Grey”) AND (Color=”Pink”) then buy them

The AND operator lets you stack both of your conditions.

Below is a simple truth table that outlines the Boolean logic at work in this statement. In the Color is Grey column, there are two pairs of shoes that meet the color condition. And in the Color is Pink column, there are two pairs that meet that condition. But in the If Grey AND Pink column, only one pair of shoes meets both conditions. So, according to the Boolean logic of the statement, there is only one pair marked true. In other words, there is one pair of shoes that you would buy.

Color is GreyColor is PinkIf Grey AND Pink, then BuyBoolean Logic
Grey/TruePink/TrueTrue/BuyTrue AND True = True
Grey/TrueBlack/FalseFalse/Don’t buyTrue AND False = False
Red/FalsePink/TrueFalse/Don’t buyFalse AND True = False
Red/FalseGreen/FalseFalse/Don’t buyFalse AND False = False 

The OR operator

The OR operator lets you move forward if either one of your two conditions is met. Your condition is “If the shoes are grey or pink, you will buy them.” The Boolean statement would be IF (Color=”Grey”) OR (Color=”Pink”) then buy them.

Notice that any shoe that meets either the Color is Grey or the Color is Pink condition is marked as true by the Boolean logic. According to the truth table below, there are three pairs of shoes that you can buy.

Color is GreyColor is PinkIf Grey OR Pink, then BuyBoolean Logic
Red/FalseBlack/FalseFalse/Don’t buyFalse OR False = False
Black/FalsePink/TrueTrue/BuyFalse OR True = True
Grey/TrueGreen/FalseTrue/BuyTrue OR False = True
Grey/TruePink/TrueTrue/BuyTrue OR True = True

The NOT operator

Finally, the NOT operator lets you filter by subtracting specific conditions from the results. Your condition is “You will buy any grey shoe except for those with any traces of pink in them.” Your Boolean statement would be IF (Color=”Grey”) AND (Color=NOT “Pink”) then buy them

Now, all of the grey shoes that aren’t pink are marked true by the Boolean logic for the NOT Pink condition. The pink shoes are marked false by the Boolean logic for the NOT Pink condition. Only one pair of shoes is excluded in the truth table below.

Color is GreyColor is PinkBoolean Logic for NOT PinkIf Grey AND (NOT Pink), then BuyBoolean Logic
Grey/TrueRed/FalseNot False = TrueTrue/BuyTrue AND True = True
Grey/TrueBlack/FalseNot False = TrueTrue/BuyTrue AND True = True
Grey/TrueGreen/FalseNot False = TrueTrue/BuyTrue AND True = True
Grey/TruePink/TrueNot True = FalseFalse/Don’t buyTrue AND False = False

The power of multiple conditions

For data analysts, the real power of Boolean logic comes from being able to combine multiple conditions in a single statement. For example, if you wanted to filter for shoes that were grey or pink, and waterproof, you could construct a Boolean statement such as: “IF ((Color = “Grey”) OR (Color = “Pink”)) AND (Waterproof=”True”)

Notice that you can use parentheses to group your conditions together.

Key takeaways

Operators are symbols that name the operation or calculation to be performed. The operators AND, OR, and NOT can be used to write Boolean statements in programming languages. Whether you are doing a search for new shoes or applying this logic to queries, Boolean logic lets you create multiple conditions to filter your results. Now that you know a little more about Boolean logic, you can start using it!

Resources for more information

Transforming data

What is data transformation?

Data transformation is the process of changing the data’s format, structure, or values. As a data analyst, there is a good chance you will need to transform data at some point to make it easier for you to analyse it. 

Data transformation usually involves:

  • Adding, copying, or replicating data 
  • Deleting fields or records 
  • Standardizing the names of variables
  • Renaming, moving, or combining columns in a database
  • Joining one set of data with another
  • Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (.csv) file.

Why transform data?

Goals for data transformation might be: 

  • Data organization: better organized data is easier to use
  • Data compatibility: different applications or systems can then use the same data
  • Data migration: data with matching formats can be moved from one system to another
  • Data merging: data with the same organization can be merged together
  • Data enhancement: data can be displayed with more detailed fields 
  • Data comparison: apples-to-apples comparisons of the data can then be made

Data transformation example: data merging

Mario is a plumber who owns a plumbing company 🙂 After years in the business, he buys another plumbing company. Mario wants to merge the customer information from his newly acquired company with his own, but the other company uses a different database. So, Mario needs to make the data compatible. To do this, he has to transform the format of the acquired company’s data. Then, he must remove duplicate rows for customers they had in common. When the data is compatible and together, Mario’s plumbing company will have a complete and merged customer database.

Data transformation example: data organization (long to wide)

To make it easier to create charts, you may also need to transform long data to wide data. Consider the following example of transforming stock prices (collected as long data) to wide data.

Long data is data where each row contains a single data point for a particular item. In the long data example below, individual stock prices (data points) have been collected for Apple (AAPL), Amazon (AMZN), and Google (GOOGL) (particular items) on the given dates.

Long data example: Stock prices

Wide data is data where each row contains multiple data points for the particular items identified in the columns. 

Wide data example: Stock prices

With data transformed to wide data, you can create a chart comparing how each company’s stock changed over the same period of time.

You might notice that all the data included in the long format is also in the wide format. But wide data is easier to read and understand. That is why data analysts typically transform long data to wide data more often than they transform wide data to long data. The following table summarizes when each format is preferred:

Wide data is preferred when  Long data is preferred when 
Creating tables and charts with a few variables about each subjectStoring a lot of variables about each subject. For example, 60 years worth of interest rates for each bank
Comparing straightforward line graphsPerforming advanced statistical analysis or graphing 

Glossary terms from module 1

Terms and definitions for Course 3, Module 1

Agenda: A list of scheduled appointments

Audio file: Digitized audio storage usually in an MP3, AAC, or other compressed format

Boolean data: A data type with only two possible values, usually true or false

Continuous data: Data that is measured and can have almost any numeric value

Cookie: A small file stored on a computer that contains information about its users

Data element: A piece of information in a dataset 

Data model: A tool for organizing data elements and how they relate to one another

Digital photo: An electronic or computer-based image usually in BMP or JPG format

Discrete data: Data that is counted and has a limited number of values

External data: Data that lives, and is generated, outside of an organization

Field: A single piece of information from a row or column of a spreadsheet; in a data table, typically a column in the table

First-party data: Data collected by an individual or group using their own resources

Long data: A dataset in which each row is one time point per subject, so each subject has data in multiple rows

Nominal data: A type of qualitative data that is categorized without a set order

Ordinal data: Qualitative data with a set order or scale

Ownership: The aspect of data ethics that presumes individuals own the raw data they provide and have primary control over its usage, processing, and sharing 

Pixel: In digital imaging, a small area of illumination on a display screen that, when combined with other adjacent areas, forms a digital image 

Population: In data analytics, all possible data values in a dataset

Record: A collection of related data in a data table, usually synonymous with row

Sample: In data analytics, a segment of a population that is representative of the entire population

Second-party data: Data collected by a group directly from its audience and then sold 

Social media: Websites and applications through which users create and share content or participate in social networking

String data type: A sequence of characters and punctuation that contains textual information (Refer to Text data type)

Structured data: Data organized in a certain format such as rows and columns

Text data type: A sequence of characters and punctuation that contains textual information (also called string data type)

Unstructured data: Data that is not organized in any easily identifiable manner

Video file: A collection of images, audio files, and other data usually encoded in a compressed format such as MP4, MV4, MOV, AVI, or FLV

Wide data: A dataset in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject

Series Navigation<< Course 3: Prepare Data For Exploration: Learning objectives and overviewsCourse 3: Prepare Data For Exploration, Module 2: Data responsibility >>
Tagged

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.