Technical, Data Analytics

Course 3: Prepare Data For Exploration, Module 3: Database Essentials

Maximise databases in data analytics

Databases enable analysts to manipulate, store, and process data. This helps them search through data a lot more efficiently to get the best insights. 

Relational databases

A relational database is a database that contains a series of tables that can be connected to form relationships. Basically, they allow data analysts to organise and link data based on what the data has in common. 

In a non-relational table, you will find all of the possible variables you might be interested in analysing all grouped together. This can make it really hard to sort through. This is one reason why relational databases are so common in data analysis: they simplify a lot of analysis processes and make data easier to find and use across an entire database. 

Normalisation is a process of organising data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database.

The key to relational databases

Tables in a relational database are connected by the fields they have in common. As a quick refresher, a primary key is an identifier that references a column in which each value is unique. In other words, it’s a column of a table that is used to uniquely identify each record within that table. The value assigned to the primary key in a particular row must be unique within the entire table. For example,

Consider a table named Students:

StudentIDNameAgeMajor
1Alice20Computer Science
2Bob22Mathematics
3Charlie21Physics

In this table, StudentID is the primary key because it uniquely identifies each student.

By contrast, a foreign key is a field within a table that is a primary key in another table. A table can have only one primary key, but it can have multiple foreign keys. These keys are what create the relationships between tables in a relational database, which helps organise and connect data across multiple tables in the database. For example,

Consider another table named Enrollments which records which students are enrolled in which courses:

EnrollmentIDStudentIDCourseID
11101
22102
33103
41104

In this Enrollments table:

  • EnrollmentID is the primary key.
  • StudentID is a foreign key that references the StudentID primary key in the Students table.

This relationship indicates which students are enrolled in which courses, linking the Students and Enrollments tables.

A primary key may also be constructed using multiple columns of a table. This type of primary key is called a composite key.

Example: Composite Primary Key

Consider a many-to-many relationship between Students and Courses. A junction table, Enrollments, is used to link students to courses they are enrolled in. In this case, the combination of StudentID and CourseID serves as a composite primary key for the Enrollments table, and each of these fields is also a foreign key.

Tables and Relationships

Students Table

StudentIDNameAgeMajor
1Alice20Computer Science
2Bob22Mathematics
3Charlie21Physics

Courses Table

CourseIDCourseNameCredits
101Intro to Programming3
102Calculus I4
103Physics I4
104Data Structures3

Enrollments Table

StudentIDCourseID
1101
1104
2102
3103
” the combination of StudentID and CourseID serves as a composite primary key for the Enrollments table, and each of these fields is also a foreign key “

Metadata is as important as the data itself

Data analytics, by design, is a field that thrives on collecting and organizing data. In this reading, you’ll learn about metadata and the type of information it can provide. In addition, you’ll explore examples of metadata.

Explore a data file by opening any file on your computer or a document in your home or workplace. What is it? Where did it come from? Is it useful? How do you know? This is where metadata comes in to provide a deeper understanding of the data. To put it simply, metadata is data about data. In database management, metadata provides information about other data and helps data analysts interpret the contents of the data within a database.

Regardless of whether you’re working with a large or small quantity of data, metadata is the mark of a knowledgeable analytics team. Metadata helps people communicate about data across the business and makes it easier to reuse data. In essence, metadata tells the who, what, when, where, which, why, and how of data.

Elements of metadata

Before examining metadata examples, it’s important to understand what type of information metadata typically provides:

  • File or document type: What type of file or document are you examining?
  • Date, time, and creator: When was it created? Who created it? When was it last modified?
  • Title and description: What is the name of the item you are examining? What type of content does it contain?
  • Geolocation: If you’re examining a photo, where was it taken?
  • Tags and categories: What is the general overview of the item that you have? Is it indexed or described in a specific way?
  • Who last modified it and when: Were any changes made to the file? If yes, when were the most recent modifications made?
  • Who can access or update it: If you’re examining a dataset, is it public? Are special permissions needed to customize or modify it?

Examples of metadata

In today’s digital world, metadata is everywhere! Here are some examples—with accompanying images—of where you might find metadata.

Photos

Whenever a photo is captured with a camera, metadata such as filename, date, time, geolocation, and the type of device on which it was taken are gathered and saved with it.

Emails

When an email is sent or received, it contains metadata such as subject line, sender, recipient, date sent, and time sent.

Emails also contain hidden metadata that includes server names, IP addresses, HTML format, and software details.

Spreadsheets and electronically created documents

Spreadsheets and documents are already filled with a considerable amount of data, so it’s no surprise that they also include metadata such as title, author, creation date, number of pages, and user comments. Additionally, spreadsheet metadata includes tab names, tables, and columns.

Books and audiobooks

Non-digital items can have metadata, too! Every book has standard metadata that will inform you of its title, author’s name, a table of contents, publisher information, copyright description, index, and a brief description of the book’s contents. Audiobook metadata also includes this data, as well as metadata specific to the audiobook, such as narrator and recording length.

Metadata and metadata repositories

As you’re learning, metadata is data about data. It clearly describes how and when data was collected and how it’s organised. Metadata puts data into context and makes the data more understandable. This helps data analysts use data to solve problems and make informed business decisions. 

In this reading, you’ll learn more about the benefits of metadata, metadata repositories, and metadata of external databases.

The benefits of metadata

Reliability

Data analysts use reliable and high-quality data to identify the root causes of any problems that might occur during analysis and to improve their results. If the data being used to solve a problem or to make a data-driven decision is unreliable, there’s a good chance the results will be unreliable as well.

Metadata helps data analysts confirm their data is reliable by making sure it is:

  • Accurate
  • Precise
  • Relevant
  • Timely

It does this by helping analysts ensure that they’re working with the right data and that the data is described correctly. For example, a data analyst completing a project with data from 2022 can use metadata to easily determine if they should use data from a particular file.  

Consistency

Data analysts thrive on consistency and aim for uniformity in their data and databases,  and metadata helps make this possible. For example, to use survey data from two different sources, data analysts use metadata to make sure the same collection methods were applied in the survey so that both datasets can be compared reliably. 

When a database is consistent, it’s easier to discover relationships between the data inside the database and data that exists elsewhere. When data is uniform, it is:

  • Organised: Data analysts can easily find tables and files, monitor the creation and alteration of assets, and store metadata. 
  • Classified: Data analysts can categorize data when it follows a consistent format, which is beneficial in cleaning and processing data.
  • Stored: Consistent and uniform data can be efficiently stored in various data repositories. This streamlines storage management tasks such as managing a database.   
  • Accessed: Users, applications, and systems can efficiently locate and use data.

Together, these benefits empower data analysts to effectively analyse and interpret their data.

Metadata repositories

Metadata repositories help data analysts ensure their data is reliable and consistent.

Metadata repositories are specialized databases specifically created to store and manage metadata. They can be kept in a physical location or a virtual environment—like data that exists in the cloud.

Metadata repositories describe where the metadata came from and store that data in an accessible form with a common structure. This provides data analysts with quick and easy access to the data. If data analysts didn’t use a metadata repository, they would have to select each file to look up its information and compare the data manually, which would waste a lot of time and effort.

Data analysts also use metadata repositories to bring together multiple sources for data analysis. Metadata repositories do this by describing the state and location of the data, the structure of the tables inside the data, and who accessed the user logs.

Metadata of external databases

Data analysts use both second-party and third-party data to gain valuable insights and make strategic, data-driven decisions. Second-party data is data that’s collected by a group directly from the group’s audience and then sold. Third-party data is provided by outside sources that didn’t collect it directly. The providers of this data are not its original collectors and do not have a direct relationship with any individuals to whom the data belongs. The outside providers get the data from websites or other programs that pull it from the various platforms where it was originally generated. 

Data analysts should understand the metadata of external databases to confirm that it is consistent and reliable. In some cases, they should also contact the owner of the third-party data to confirm that it is accessible and available for purchase. Confirming that the data is reliable and that the proper permissions to use it have been obtained are best practices when using data that comes from another organization.

Series Navigation<< Course 3: Prepare Data For Exploration, Module 2: Data responsibilityCourse 3: Prepare Data For Exploration, Module 4: Organise and Secure Data >>
Tagged , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.