Technical, Data Analytics

Course 3: Prepare Data For Exploration, Module 4: Organise and Secure Data

File organisation guidelines

Every data analyst’s goal is to conduct efficient data analysis. One way to increase the efficiency of your analyses is to streamline processes that help save time and energy in the long run. Meaningful, logical, and consistent file names help data analysts organise their data and automate their analysis process. When you use consistent guidelines to describe the content, date, or version of a file and its name, you’re using file naming conventions.

Best practices for naming files

File-naming conventions help you organise, access, process, and analyse data because they act as quick reference points to identify what’s in a file. One important practice is to decide on file naming conventions—as a team or company—early in a project. This will prevent you from spending time updating file names later, which can be a time-consuming process. In addition, you should align your project’s file names with your team’s or company’s existing file-naming conventions. You don’t want to spend time learning a new file-naming convention each time you look up a file in a new project!

It’s also critical to ensure that file names are meaningful, consistent, and easy-to-read. File names should include:

  • The project’s name
  • The file creation date
  • Revision version
  • Consistent style and order

Further, file-naming conventions should act as quick reference points to identify what is in the file. Because of this, they should be short and to the point.

In the following sections, you’ll explore each part of a sales report file name that follows an established naming convention, SalesReport_20231125_v02. This example will help you understand the key parts of a strong file name and why they’re important.

Name

Giving a file a meaningful name to describe its contents makes searching for it straightforward. It also makes it easy to understand the type of data the file contains.

In the example, the file name includes the text SalesReport, a succinct description of what the file contains: a sales report.

Knowing when a file was created can help you understand if it is relevant to your current analysis. For example, you might want to analyse only data from 2023.

Creation date

In the example, the year is described as 20231125. This reads as the sales report from November 25, 2023 following the year, month, and day (YYYYMMDD) format of the international date standard. Keep in mind that different countries follow different date conventions, so make sure you know the date standard your company follows.

Revision version

Including a revision version helps ensure you’re working with the correct file. You wouldn’t want to make edits to an old version of a file without realising it! When you include revision numbers in a file name, lead with a zero. This way, if your team reaches more than nine rounds of revisions, double digits are already built into your convention.

In the example, the version is described as v02. The v is short for the version of the file, and the number following the v indicates which round of revisions the file is currently in.

Consistent order and style

Make sure the information you include in a file name follows a consistent order. For example, you wouldn’t want version three of the sales report in the example to be titled 20231125_v03_SalesReport. It would be difficult to find and compare multiple documents.

When you use spaces and special characters in a file name, software may not be able to recognize them, which causes problems and errors in some applications. An alternative is to use hyphens, underscores, and capital letters. The example includes underscores between each piece of information, but your team could choose to use hyphens between year, month, and date, too: SalesReport_2023_11_25_v02.

Ensure team consistency

To ensure all team members use the agreed-upon file naming conventions, create a text file as a sample that includes all the naming conventions on a project. This can benefit new team members to help them quickly get up to speed or a current team member who just needs a refresher on the file naming conventions.

File organisation

To keep your files organised, create folders and subfolders—in a logical hierarchy—to ensure related files are stored together and can be found easily later. A hierarchy is a way of organising files and folders. Broader-topic folders are located at the top of the hierarchy, and more specific subfolders and files are contained within those folders. Each folder can contain other folders and files. This allows you to group related files together and makes it easier to find the files you need. In addition, it’s a best practice to store completed files separately from in-progress files so the files you need are easy to find. Archive older files in a separate folder or in an external storage location.

Data Security

Data security means protecting data from unauthorised access or corruption by putting safety measures in place. Usually, the purpose of data security is to keep unauthorised users from accessing or viewing sensitive data. Data analysts have to find a way to balance data security with their actual analysis needs. This can be tricky– we want to keep our data safe and secure, but we also want to use it as soon as possible so that we can make meaningful and timely observations. 

In order to do this, companies need to find ways to balance their data security measures with their data access needs.

Luckily, there are a few security measures that can help companies do just that. The two we will talk about here are encryption and tokenisation. 

Encryption uses a unique algorithm to alter data and make it unusable by users and applications that don’t know the algorithm. This algorithm is saved as a “key” which can be used to reverse the encryption; so if you have the key, you can still use the data in its original form.  

Tokenisation replaces the data elements you want to protect with randomly generated data referred to as a “token.” The original data is stored in a separate location and mapped to the tokens. To access the complete original data, the user or application needs to have permission to use the tokenised data and the token mapping. This means that even if the tokenised data is hacked, the original data is still safe and secure in a separate location. 

Encryption and tokenisation are just some of the data security options out there. There are a lot of others, like using authentication devices for AI technology. 

As a junior data analyst, you probably won’t be responsible for building out these systems. A lot of companies have entire teams dedicated to data security or hire third party companies that specialise in data security to create these systems. But it is important to know that all companies have a responsibility to keep their data secure, and to understand some of the potential systems your future employer might use.

However, one thing you absolutely can do to help strike the right balance is to use version control best practices. Version control enables all collaborators within a file to track changes over time. You can understand who made what changes to a file, when they were made, and why.

Here’s a simple example: Perhaps you’re working on a project with a team of other people. You are all collaborating within the same set of files, but each person is responsible for a different part of the project. Without version control, it would be very difficult to keep track of who made what changes to the files and when. This would lead to confusion and, even worse, people accidentally overwriting each other’s work! Version control is essential for data analytics professionals because it allows users to effectively collaborate with others and experiment with new ideas without fear of losing their work.

Please read here for more about version control (in Kaggle)

Series Navigation<< Course 3: Prepare Data For Exploration, Module 3: Database EssentialsCourse 4: Process Data from Dirty to Clean: Overview >>
Tagged ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.