Technical, Data Analytics

Course 5: Analyse Data to Answer Questions, Module 3: Aggregate data for analysis

graphical user interface

VLOOKUP core concepts

Spreadsheet functions can be used to quickly find information and perform calculations using specific values. VLOOKUP, or Vertical Lookup, is one such function that vertically searches for a certain value in a column to return a corresponding piece of information. In this reading, you’ll examine the intricacies of this extremely useful function so you understand how it works when you use it to analyse data.

VLOOKUP functionality

VLOOKUP searches for a search term, called a search_key, in one column of a spreadsheet. When the search_key is found, the function returns the data from another column of the row from which it was located. VLOOKUP returns only the value that corresponds to the first item it matches. So, if there are multiple matching values, the spreadsheet will return only data about the first one.

VLOOKUP use cases

Here are two common reasons why you might use VLOOKUP:

Merging data from one spreadsheet with data in another. If a teacher keeps one spreadsheet for student grades and another for attendance, they could use VLOOKUP to combine the spreadsheets. That way, they could search for a particular student in the attendance sheet, and VLOOKUP would pull the corresponding attendance record into the grades spreadsheet.

Populating data in a spreadsheet. Perhaps a store manager is tracking incoming shipments before a busy holiday. They could use VLOOKUP to look up product ID codes in a product spreadsheet and retrieve the corresponding product information from another spreadsheet. This would help the manager know how many stock clerks they need to schedule to work when the shipments arrive.

VLOOKUP syntax

VLOOKUP is available in both Microsoft Excel and Google Sheets. Here, you’ll explore its syntax in Google Sheets. Refer to the resources at the end of this reading for more information about VLOOKUP in Microsoft Excel.

VLOOKUP’s syntax is:

VLOOKUP(search_key, range, index, is_sorted)

The following sections explain each of the four parts of the syntax.

search_key

This is the value the function will search for. It can be a number, text string, or cell reference.

range

This is the range of cells over which the function will search and return information. The first column in the range is searched. When the search key is found, the index from that row is returned.

For example, if you search for the search_key in column B and return the data from column D, the range would need to include columns B through D, such as the range B2:D10. If you specified a range of A2:D10, the function would search for the search term in column A.

The search_key must be to the left of the information you want the function to return. This may require you to move columns around before you use VLOOKUP. For example, if you plan to search for the search_key column D, but the information you want the function to return is in column A, you must rearrange your columns before using VLOOKUP.

index

This is the position of the column that contains the data to be returned. The first column in the range is column number 1, and each column is numbered sequentially to the right.

For example, if the range is B2:D10 and you want to return a value from column D, the index number would be 3. If the index is not between 1 and the number of columns in range, the error message #VALUE! will be returned.

is_sorted

This indicates whether to return an approximate or exact match. For example, if you’re searching for Google, then google would not count as a match.

  • To return an exact match, set is_sorted to FALSE. This is recommended.
  • To return an approximate match, set is_sorted to TRUE. The nearest match (less than or equal to the search_key) is returned. To use this option to obtain accurate results, you must sort your data in ascending order. But, you could still find a value.
  • If neither TRUE nor FALSE are selected, the function will default to TRUE.

The #N/A error

#N/A indicates that a matching value can’t be returned because no matches were found.

Key takeaways

Use VLOOKUP to search for a value in a column and return a corresponding piece of information. It’s a very useful tool for data professionals, as it enables them to combine data from multiple sources and find information quickly. Keep in mind that the column that matches the search_key in a VLOOKUP formula should be on the left side of the data. The range must include both the column being searched and the column that contains the information being returned. TRUE means an approximate match, and FALSE means an exact match on the search_key.

VLOOKUP resources for Microsoft Excel

VLOOKUP may slightly differ in Microsoft Excel, but the overall concepts can still be generally applied. Refer to the following resources if you’re working with Excel:

Use JOINs effectively

In this reading, we will see how JOINs are used and will be introduced to some resources that you can use to learn more about them. A JOIN combines tables by using a primary or foreign key to align the information coming from both tables in the combination process. JOINs use these keys to identify relationships and corresponding values across tables. 

If you need a refresher on primary and foreign keys, go back here for this course

The general JOIN syntax

SQL
SELECT 
   -- table columns from tables are inserted here
   table_name1.column_name
   table_name2.column_name
FROM
   table_name1
JOIN
   table_name2
ON table_name1.column_name = table_name2.column_name

As you can see from the syntax, the JOIN statement is part of the FROM clause of the query. JOIN in SQL indicates that you are going to combine data from two tables. ON in SQL identifies how the tables are to be matched for the correct information to be combined from both. 

Type of JOINs

There are four general ways in which to conduct JOINs in SQL queries: INNER, LEFT, RIGHT, and FULL OUTER.

Here is what these different JOIN queries do.

INNER JOIN

INNER is optional in this SQL query because it is the default as well as the most commonly used JOIN operation. You may see this as JOIN only. INNER JOIN returns records if the data lives in both tables. For example, if you use INNER JOIN for the customers and orders tables and match the data using the customer_id key, you would combine the data for each customer_id that exists in both tables. If a customer_id exists in the customers table but not the orders table, data for that customer_id isn’t joined or returned by the query.

SQL
SELECT
   customers.customer_name,
   orders.product_id,
   orders.ship_date
FROM
   customers 
INNER JOIN 
   orders 
ON customers.customer_id = orders.customer_id

The results from the query might look like the following, where customer_name is from the customers table and product_id and ship_date are from the orders table:

customer_nameproduct_idship_date
Martin’s Ice Cream0439982021-02-23
Beachside Treats8720122021-02-25
Mona’s Natural Flavors7249562021-02-28
… etc.… etc.… etc.

The data from both tables was joined together by matching the customer_id common to both tables. Notice that customer_id doesn’t show up in the query results. It is simply used to establish the relationship between the data in the two tables so the data can be joined and returned. 

LEFT JOIN

You may see this as LEFT OUTER JOIN, but most users prefer LEFT JOIN. Both are correct syntax. LEFT JOIN returns all the records from the left table and only the matching records from the right table. Use LEFT JOIN whenever you need the data from the entire first table and values from the second table, if they exist. For example, in the query below, LEFT JOIN will return customer_name with the corresponding sales_rep, if it is available. If there is a customer who did not interact with a sales representative, that customer would still show up in the query results but with a NULL value for sales_rep.

SQL
SELECT
   customers.customer_name, 
   sales.sales_rep
FROM 
   customers 
LEFT JOIN 
   sales 
ON customers.customer_id = sales.customer_id

The results from the query might look like the following where customer_name is from the customers table and sales_rep is from the sales table. Again, the data from both tables was joined together by matching the customer_id common to both tables even though customer_id wasn’t returned in the query results.

customer_namesales_rep
Martin’s Ice CreamLuis Reyes
Beachside TreatsNULL
Mona’s Natural FlavorsGeri Hall
…etc.…etc.

RIGHT JOIN

You may see this as RIGHT OUTER JOIN or RIGHT JOIN. RIGHT JOIN returns all records from the right table and the corresponding records from the left table. Practically speaking, RIGHT JOIN is rarely used. Most people simply switch the tables and stick with LEFT JOIN. But using the previous example for LEFT JOIN, the query using RIGHT JOIN would look like the following:

SQL
SELECT
   sales.sales_rep,
   customers.customer_name
FROM 
  sales 
RIGHT JOIN 
  customers 
ON sales.customer_id = customers.customer_id

The query results are the same as the previous LEFT JOIN example.

customer_namesales_rep
Martin’s Ice CreamLuis Reyes
Beachside TreatsNULL
Mona’s Natural FlavorsGeri Hall
…etc.…etc.

FULL OUTER JOIN

You may sometimes see this as FULL JOIN. FULL OUTER JOIN returns all records from the specified tables. You can combine tables this way, but remember that this can potentially be a large data pull as a result. FULL OUTER JOIN returns all records from both tables even if data isn’t populated in one of the tables. For example, in the query below, you will get all customers and their products’ shipping dates. Because you are using a FULL OUTER JOIN, you may get customers returned without corresponding shipping dates or shipping dates without corresponding customers. A NULL value is returned if corresponding data doesn’t exist in either table.

SQL
SELECT
   customers.customer_name,
   orders.ship_date
FROM 
   customers 
FULL OUTER JOIN 
   orders
ON customers.customer_id = orders.customer_id

The results from the query might look like the following.

customer_nameship_date
Martin’s Ice Cream2021-02-23
Beachside Treats2021-02-25
NULL2021-02-25
The Daily ScoopNULL
Mountain Ice CreamNULL
Mona’s Natural Flavors2021-02-28
…etc.…etc.

For more information

JOINs are going to be useful for working with relational databases and SQL—and you will have plenty of opportunities to practice them on your own. Here are a few other resources that can give you more information about JOINs and how to use them:

  • SQL JOINs: This is a good basic explanation of JOINs with examples. If you need a quick reminder of what the different JOINs do, this is a great resource to bookmark and come back to later.  
  • Database JOINs – Introduction to JOIN Types and Concepts: This is a really thorough introduction to JOINs. Not only does this article explain what JOINs are and how to use them, but it also explains the various scenarios in more detail of when and why you would use the different JOINs. This is a great resource if you are interested in learning more about the logic behind JOINing.
  • SQL JOIN Types Explained in Visuals: This resource has a visual representation of the different JOINs. This is a really useful way to think about JOINs if you are a visual learner, and it can be a really useful way to remember the different JOINs. 
  • SQL JOINs: Bringing Data Together One Join at a Time: Not only does this resource have a detailed explanation of JOINs with examples, but it also provides example data that you can use to follow along with their step-by-step guide. This is a useful way to practice JOINs with some real data. 
  • SQL JOIN: This is another resource that provides a clear explanation of JOINs and uses examples to demonstrate how they work. The examples also combine JOINs with aliasing. This is a great opportunity to see how JOINs can be combined with other SQL concepts that you have been learning about in this course.  

Secret identities: The importance of aliases

We will learn about using aliasing to simplify your SQL queries. Aliases are used in SQL queries to create temporary names for a column or table. Aliases make referencing tables and columns in your SQL queries much simpler when you have table or column names that are too long or complex to make use of in queries. Imagine a table name like special_projects_customer_negotiation_mileages. That would be difficult to re-enter every time you use that table. With an alias, you can create a meaningful nickname that you can use for your analysis. In this case

special_projects_customer_negotiation_mileages

can be aliased to simply mileage. Instead of having to write out the long table name, you can use a meaningful nickname that you decide. 

Basic syntax for aliasing

Aliasing is the process of using aliases. In SQL queries, aliases are implemented by making use of the AS command. The basic syntax for the AS command can be seen in the following query for aliasing a table:

SQL
SELECT column_name(s)
FROM table_name AS alias_name;

Notice that AS is preceded by the table name and followed by the new nickname. It is a similar approach to aliasing a column:

SQL
SELECT column_name AS alias_name
FROM table_name;

In both cases, you now have a new name that you can use to refer to the column or table that was aliased.  

Alternate syntax for aliases

If using AS results in an error when running a query because the SQL database you are working with doesn’t support it, you can leave it out. In the previous examples, the alternate syntax for aliasing a table or column would be:

  • FROM table_name alias_name
  • SELECT column_name alias_name

The key takeaway is that queries can run with or without using AS for aliasing, but using AS has the benefit of making queries more readable. It helps to make aliases stand out more clearly.

Aliasing in action

Let’s check out an example of a SQL query that uses aliasing. Let’s say that you are working with two tables: one of them has employee data and the other one has department data. The FROM statement to alias those tables could be:

SQL
FROM work_day.employees AS employees

For more information

If you are interested in learning more about aliasing, here are some resources to help you get started: 

  • SQL Aliases: This tutorial on aliasing is a really useful resource to have when you start practising writing queries and aliasing tables on your own. It also demonstrates how aliasing works with real tables. 
  • SQL Alias: This detailed introduction to aliasing includes multiple examples. This is another great resource to reference if you need more examples. 
  • Using Column Aliasing: This is a guide that focuses on column aliasing specifically. Generally, you will be aliasing entire tables, but if you find yourself needing to alias just a column, this is a great resource to have bookmarked.  

SQL functions and subqueries: A functional friendship

As you’ve been learning, SQL functions are tools built into SQL to facilitate performing calculations. For example, you could use the AVG() function to calculate the average salary of employees in a table so management knows what to budget for next year. Another example might be using the COUNT() function to count the number of orders in a table to track daily order inventory.

A subquery, also called an inner or nested query, is a SQL query that is nested inside a larger query. Going back to the previous example, you could add a subquery to your average calculation to identify the names of employees who earn more or less than the average salary to include that information in performance reviews. Subqueries allow more complex questions to be answered in a single query, making data retrieval more efficient. In this reading, you will learn about SQL functions and how they might be used with subqueries.

How do SQL functions function?

SQL functions help make data aggregation possible. As a refresher, data aggregation is the process of gathering data from multiple sources in order to combine it into a single, summarized collection. Take a moment to review some of these functions to better understand how to run these queries:

  • HAVING: The HAVING clause filters the results of a SQL query based on conditions applied after the grouping. Check out W3School’s HAVING overview for a tutorial on this clause
  • CASE: CASE provides conditional logic in SQL queries, similar to an ‘if-else’ structure in programming languages. The W3School’s CASE overview explores the use of the CASE statement and how it works.
  • IF: IF performs a simple conditional test and returns a value depending on the outcome. Review W3School’s IF overview for a tutorial of the IF function and examples that you can practice with.
  • COUNT: COUNT performs a simple conditional test and returns a value depending on the outcome. Though it seems simple, the COUNT function is just as important as all the rest. The W3School’s COUNT overview provides a tutorial and examples.

Subqueries

Subqueries can make projects easier and more efficient by allowing complex operations to be performed in a single query, reducing the need for multiple trips to the database. Subqueries also make your code more readable and maintainable. Take the employee salary example mentioned before.:The original query was used to find the average employee salary. By adding a subquery, you can learn this plus identify employees who earn more than the average—all in a single query.

Usually, you will find subqueries nested in the SELECT, FROM, and/or WHERE clauses. There is no general syntax for subqueries, but the syntax for a basic subquery follows a similar pattern:

SQL
SELECT account_table.*
	FROM (
		SELECT *
		FROM transaction.sf_model_feature_2014_01
		WHERE day_of_week = 'Friday'
		) account_table
	WHERE account_table.availability = 'YES'

Basically, there’s another SELECT clause inside the first SELECT clause. The second SELECT clause marks the start of the subquery in this statement. There are many different ways in which you can use subqueries, but there are a few rules to follow:

  • Subqueries must be enclosed within parentheses.
  • A subquery can have one or more columns specified in the SELECT clause.
  • Subqueries that return more than one row can only be used with multiple value operators, such as the IN operator which allows you to specify multiple values in a WHERE clause.
  • A subquery can’t be nested in a SET command. The SET command is used with UPDATE to specify which columns (and values) are to be updated in a table.

Additional resources

The following resources offer more guidance into subqueries and their usage:

  • SQL subqueries: This detailed introduction includes the definition of a subquery, its purpose in SQL, when and how to use it, and what the results will be.
  • Writing subqueries in SQL: Explore the basics of subqueries in this interactive tutorial, including examples and practice problems that you can work through.

As you continue to learn more about using SQL, functions, and subqueries, you will realize how much time you can truly save when memorizing these tips and tricks.

Series Navigation<< Course 5: Analyse Data to Answer Questions, Module 2: Format and adjust dataCourse 5: Analyse Data to Answer Questions,  Module 4: Perform Data Calculations >>
Tagged ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.