- Foundation Course Module 1: Introduction of data analytics and analytical thinking
- Foundation Course Module 2 : The wonderful world of data
- Foundation Course Module 3 : Setup of data analytics toolbox
- Foundation Course Module 4: Becoming a fair and impactful data professional
- Foundation Course: Glossary
- Course 2: Ask questions to make data driven decisions, Module 1: Ask effective questions
- Course 2: Ask questions to make data driven decisions, Module 2: Make data-driven decisions
- Course 2: Ask questions to make data driven decisions, Module 3: Spreadsheet magic
- Course 2: Ask questions to make data driven decisions, Module 4: Always remember the stakeholder
- Course 3: Prepare Data For Exploration: Learning objectives and overviews
- Course 3: Prepare Data For Exploration, Module 1: Data types and structures
- Course 3: Prepare Data For Exploration, Module 2: Data responsibility
- Course 3: Prepare Data For Exploration, Module 3: Database Essentials
- Course 3: Prepare Data For Exploration, Module 4: Organise and Secure Data
- Course 4: Process Data from Dirty to Clean: Overview
- Course 4: Process Data from Dirty to Clean, Module 1: The importance of integrity
- Course 4: Process Data from Dirty to Clean, Module 2: Clean it up
- Course 4: Process Data from Dirty to Clean, Module 3: SQL
- Course 4: Process Data from Dirty to Clean, Module 4: Verify and Report Results
- Course 5: Analyse Data to Answer Questions, Module 1: Organise data for more effective analysis
- Course 5: Analyse Data to Answer Questions, Module 2: Format and adjust data
- Course 5: Analyse Data to Answer Questions, Module 3: Aggregate data for analysis
- Course 5: Analyse Data to Answer Questions, Module 4: Perform Data Calculations
- Course 6: Share Data Through the Art of Visualisation, Course Overview plus Module 1: Visualise Data
- Course 6: Share Data Through the Art of Visualisation, Course Overview plus Module 2: Create Data Visualisation with Tableau
Maximise databases in data analytics
Databases enable analysts to manipulate, store, and process data. This helps them search through data a lot more efficiently to get the best insights.
Relational databases
A relational database is a database that contains a series of tables that can be connected to form relationships. Basically, they allow data analysts to organise and link data based on what the data has in common.
In a non-relational table, you will find all of the possible variables you might be interested in analysing all grouped together. This can make it really hard to sort through. This is one reason why relational databases are so common in data analysis: they simplify a lot of analysis processes and make data easier to find and use across an entire database.
Normalisation is a process of organising data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database.
The key to relational databases
Tables in a relational database are connected by the fields they have in common. As a quick refresher, a primary key is an identifier that references a column in which each value is unique. In other words, it’s a column of a table that is used to uniquely identify each record within that table. The value assigned to the primary key in a particular row must be unique within the entire table. For example,
Consider a table named Students
:
StudentID | Name | Age | Major |
---|---|---|---|
1 | Alice | 20 | Computer Science |
2 | Bob | 22 | Mathematics |
3 | Charlie | 21 | Physics |
In this table, StudentID
is the primary key because it uniquely identifies each student.
By contrast, a foreign key is a field within a table that is a primary key in another table. A table can have only one primary key, but it can have multiple foreign keys. These keys are what create the relationships between tables in a relational database, which helps organise and connect data across multiple tables in the database. For example,
Consider another table named Enrollments
which records which students are enrolled in which courses:
EnrollmentID | StudentID | CourseID |
---|---|---|
1 | 1 | 101 |
2 | 2 | 102 |
3 | 3 | 103 |
4 | 1 | 104 |
In this Enrollments
table:
EnrollmentID
is the primary key.StudentID
is a foreign key that references theStudentID
primary key in theStudents
table.
This relationship indicates which students are enrolled in which courses, linking the Students
and Enrollments
tables.
A primary key may also be constructed using multiple columns of a table. This type of primary key is called a composite key.
Example: Composite Primary Key
Consider a many-to-many relationship between Students
and Courses
. A junction table, Enrollments
, is used to link students to courses they are enrolled in. In this case, the combination of StudentID
and CourseID
serves as a composite primary key for the Enrollments
table, and each of these fields is also a foreign key.
Tables and Relationships
Students Table
StudentID | Name | Age | Major |
---|---|---|---|
1 | Alice | 20 | Computer Science |
2 | Bob | 22 | Mathematics |
3 | Charlie | 21 | Physics |
Courses Table
CourseID | CourseName | Credits |
---|---|---|
101 | Intro to Programming | 3 |
102 | Calculus I | 4 |
103 | Physics I | 4 |
104 | Data Structures | 3 |
Enrollments Table
StudentID | CourseID |
---|---|
1 | 101 |
1 | 104 |
2 | 102 |
3 | 103 |
StudentID
and CourseID
serves as a composite primary key for the Enrollments
table, and each of these fields is also a foreign key “Metadata is as important as the data itself
Data analytics, by design, is a field that thrives on collecting and organizing data. In this reading, you’ll learn about metadata and the type of information it can provide. In addition, you’ll explore examples of metadata.
Explore a data file by opening any file on your computer or a document in your home or workplace. What is it? Where did it come from? Is it useful? How do you know? This is where metadata comes in to provide a deeper understanding of the data. To put it simply, metadata is data about data. In database management, metadata provides information about other data and helps data analysts interpret the contents of the data within a database.
Regardless of whether you’re working with a large or small quantity of data, metadata is the mark of a knowledgeable analytics team. Metadata helps people communicate about data across the business and makes it easier to reuse data. In essence, metadata tells the who, what, when, where, which, why, and how of data.
Elements of metadata
Before examining metadata examples, it’s important to understand what type of information metadata typically provides:
- File or document type: What type of file or document are you examining?
- Date, time, and creator: When was it created? Who created it? When was it last modified?
- Title and description: What is the name of the item you are examining? What type of content does it contain?
- Geolocation: If you’re examining a photo, where was it taken?
- Tags and categories: What is the general overview of the item that you have? Is it indexed or described in a specific way?
- Who last modified it and when: Were any changes made to the file? If yes, when were the most recent modifications made?
- Who can access or update it: If you’re examining a dataset, is it public? Are special permissions needed to customize or modify it?
Examples of metadata
In today’s digital world, metadata is everywhere! Here are some examples—with accompanying images—of where you might find metadata.
Photos
Whenever a photo is captured with a camera, metadata such as filename, date, time, geolocation, and the type of device on which it was taken are gathered and saved with it.
Emails
When an email is sent or received, it contains metadata such as subject line, sender, recipient, date sent, and time sent.
Emails also contain hidden metadata that includes server names, IP addresses, HTML format, and software details.
Spreadsheets and electronically created documents
Spreadsheets and documents are already filled with a considerable amount of data, so it’s no surprise that they also include metadata such as title, author, creation date, number of pages, and user comments. Additionally, spreadsheet metadata includes tab names, tables, and columns.
Books and audiobooks
Non-digital items can have metadata, too! Every book has standard metadata that will inform you of its title, author’s name, a table of contents, publisher information, copyright description, index, and a brief description of the book’s contents. Audiobook metadata also includes this data, as well as metadata specific to the audiobook, such as narrator and recording length.
Metadata and metadata repositories
As you’re learning, metadata is data about data. It clearly describes how and when data was collected and how it’s organised. Metadata puts data into context and makes the data more understandable. This helps data analysts use data to solve problems and make informed business decisions.
In this reading, you’ll learn more about the benefits of metadata, metadata repositories, and metadata of external databases.
The benefits of metadata
Reliability
Data analysts use reliable and high-quality data to identify the root causes of any problems that might occur during analysis and to improve their results. If the data being used to solve a problem or to make a data-driven decision is unreliable, there’s a good chance the results will be unreliable as well.
Metadata helps data analysts confirm their data is reliable by making sure it is:
- Accurate
- Precise
- Relevant
- Timely
It does this by helping analysts ensure that they’re working with the right data and that the data is described correctly. For example, a data analyst completing a project with data from 2022 can use metadata to easily determine if they should use data from a particular file.
Consistency
Data analysts thrive on consistency and aim for uniformity in their data and databases, and metadata helps make this possible. For example, to use survey data from two different sources, data analysts use metadata to make sure the same collection methods were applied in the survey so that both datasets can be compared reliably.
When a database is consistent, it’s easier to discover relationships between the data inside the database and data that exists elsewhere. When data is uniform, it is:
- Organised: Data analysts can easily find tables and files, monitor the creation and alteration of assets, and store metadata.
- Classified: Data analysts can categorize data when it follows a consistent format, which is beneficial in cleaning and processing data.
- Stored: Consistent and uniform data can be efficiently stored in various data repositories. This streamlines storage management tasks such as managing a database.
- Accessed: Users, applications, and systems can efficiently locate and use data.
Together, these benefits empower data analysts to effectively analyse and interpret their data.
Metadata repositories
Metadata repositories help data analysts ensure their data is reliable and consistent.
Metadata repositories are specialized databases specifically created to store and manage metadata. They can be kept in a physical location or a virtual environment—like data that exists in the cloud.
Metadata repositories describe where the metadata came from and store that data in an accessible form with a common structure. This provides data analysts with quick and easy access to the data. If data analysts didn’t use a metadata repository, they would have to select each file to look up its information and compare the data manually, which would waste a lot of time and effort.
Data analysts also use metadata repositories to bring together multiple sources for data analysis. Metadata repositories do this by describing the state and location of the data, the structure of the tables inside the data, and who accessed the user logs.
Metadata of external databases
Data analysts use both second-party and third-party data to gain valuable insights and make strategic, data-driven decisions. Second-party data is data that’s collected by a group directly from the group’s audience and then sold. Third-party data is provided by outside sources that didn’t collect it directly. The providers of this data are not its original collectors and do not have a direct relationship with any individuals to whom the data belongs. The outside providers get the data from websites or other programs that pull it from the various platforms where it was originally generated.
Data analysts should understand the metadata of external databases to confirm that it is consistent and reliable. In some cases, they should also contact the owner of the third-party data to confirm that it is accessible and available for purchase. Confirming that the data is reliable and that the proper permissions to use it have been obtained are best practices when using data that comes from another organization.