Chapter 2 Technical preparations
Learning outcomes At the end of this chapter, you will be able to recognize and avoid some of the most common mistakes in storing and handling data. You will be able to name your files in a proper way and to avoid potential future problems.
2.1 Overall recommendations, tips and tricks
The motivation to write this section comes from my previous experiences of the MSc level course I teach on Management and analysis of high-density genomic data. During the lectures, I met students who in the follow-up weeks became very versatile in simple analyses of genomic data. So this was great! Most of them however had some pretty awful habits when it came to their practices in data handling and storage. The previous statement is less about criticism and more of a note that the conventional user of a computer, especially on a Windows machine is just spoiled. Everything works! We do not even need to think where our files are, as we get there in a few clicks, or in the worst case we use the built-in search functions.
You can, and you have to do better if you are even half serious about genomic data analysis. This part will answer the questions of what can you improve and what are the "best" ways to do that.
So let us consider a typical case: Student X is interested in genomic data analysis and jumps right at the later chapters of this book. They create a new folder on the Windows desktop called "Genomic data analysis", where all data from all lectures gets copied, including programs and scripts they write. This way they will know where things are, should they ever need them again. There are multiple problems with this approach. Let's see what these problems are and how could be improved:
Do not store anything on the desktop
- Although it seems to be so close visually, the files on the Windows desktop are deeper in the computer's file structure. This might then introduce unnecessary complexity in your scripts when it comes to the definition of the file PATH
- The easy point and click access also means that it is more easily deleted by an accident by you or others (the danger is especially high during home office sessions if there are small children around)
Do not store anything of importance on the system drive (usually C:)
- In case of major problems this drive is the one that gets wiped first, so to avoid future hassle just avoid having anything there that you care about
In case your computer has only a single drive - Use Backup
- I can not stress enough how the backup of all important files. Even for the less important ones is absolutely crucial!
- The C: drive can go down in a system failure, virus, or similar, but your work is not safe on other drives either
- External hard drives and USB sticks are not a solution! These break down and get lost surprisingly easily, so arguably they are even worse than your laptop's drive
- My strong suggestion is to use cloud storage services - even the free options give you enough space to keep your work safe
Smart strategies to utilize cloud storage
- Let's say you go on with the free version of your favorite service, which is typically not enough to store large amounts of data
- You can use the cloud storage for your script files and other documents you write - these are typically very small files into which you invested a lot of your time. So they are on the top of your list when it comes to protection
- You can even store crucial genotype data in a binary ped format (more on this later) which takes a very little HDD space
One project - one folder
- Unlike Student X, do not dump everything into one folder! It seems intuitive if you have one, maybe two things to analyze, but you quickly lose track afterward
- As you will see in the future analyses, the programs tend to produce a lot of temporary files, or just files that you do not need. You might be even deleting some of them. In this process, it is way too easy to delete your script files, or pieces of the input data as well. ...and Pooof! there goes your whole day effort to put something meaningful together!
- Of course, this does not apply if you set your recycle bin not to hard delete files right away, so ensure it is set this way.
- Even if you do and kept my previous advice on cloud storage, you can use the "undelete" function of the Windows recycle bin to save the day
- But even these Get-out-of-jail-free-cards do not solve our main problem of being lost and confused if you have 15 types of analyses in a folder, so just stick to the one folder per project
Use folders within folders
- Yes! You can even do more and better by a standard internal organization of your folder structures
- There is a lot of room for experimentation and individual flavors here, but the two things I would suggest
** Folder for script files: These could be stored literally anywhere on your computer and still work with any other data via a correctly set PATH, so you might just store them in a secure (Backupped!) place. ** Folder for the original data: I like to keep separate original and untouched data in its own folder. If anything happens during the analyses (and trust me, a lot of very unexpected things tend to happen), I can recreate everything with the original data and the saved script files.
Use descriptive and appropriate names for your files and folders
- This is a topic on its own with a lot of issues to unpack, so it is described in detail in the File naming conventions part below.
- To keep it short here, Student X in our example case just sets themself up for future problems using spaces in folder names, as this could backfire in unexpected ways. So don't use spaces in filenames.
- Also, the name of the folder does not tell anything about its contents. You can spare yourself quite a bit of time in the future to take a bit of time now and give a very descriptive name with an obvious link to the content, e.g. "2020_pcaAustrianLeonbergerDogs".
So these were my tips and tricks that you could consider when starting out. They are based on my own experiences and the approaches I commonly use. If you have other similar tips to share or discuss the ones presented here, let me know via my Twitter.
2.2 File naming conventions
In this part of the Genomics Boot Camp, I want to elaborate on what the PATH is, why is it useful for you to know about it, and what to keep in mind when analyzing any kind of data. So first things first. The PATH (written in all caps) is not the name of some sketchy religious organization, but the set of directories where your executable files or data are located. You can think of it as the address of the files on your computer.
In most of the programs, you will work with you will have to specify file locations on your computer. Therefore, it is good to know how does it work and what conventions to follow, to avoid future problems.
There is a lot of freedom for self-expression and to include your own spins and flavors when it comes to naming anything on your computer, but there are a few basic rules you should abide by.
Select a good location for your files
As I suggested before in the Overall recommendations, tips and tricks, it is a good thing to store your files outside the system drive. In particular, the Desktop should be avoided, as it includes your username, which can be tricky sometimes. It might violate some of the rules established below.
No spaces in file and folder names
This by far the most common violation of good practices, when it comes to beginner learners. Some programs became more forgiving in this aspect, but avoiding spaces in names can spare you quite some headaches on your data analysis journey.
No special characters in filenames
The most common case here is punctuation and brackets, but also special characters of the various languages. My rule is to use only letters available on the English keyboard.
Use descriptive names
Upon the first glance at the name of any file or folder, it should be obvious to you (and any other person) what it contains. In this respect names like "final data" or "a.txt" are pretty good examples, what not to do. Try to give it a spin that falls into your flavor of naming things, but also keeps the required amount of clarity.
Naming conventions
As for me, I like to use a combined system that sort of evolved as I went along. For example, I have a folder for a project I work on called "2015_Appear_LocaBreed". There are several things to talk about here.
When it comes to folder names, I picked up the habit of starting with a number, which makes it easier to sort the folders and follow a certain established logic. In the case of projects (as in the example), papers or analyses I like to start with the year. This could be also a sequential number, e.g. a series of folders for paper submissions called "1_draft", "2_submission", "3_review"
You have surely noticed that none of these names contain any space, but still are fairly easily readable. This is due to the naming conventions used in these examples. And you can do it too!
Let's consider another example of a folder name:
sheepdiversityprojectlapampaargentina
The name is ok, but the readability is pretty horrible. We can improve it a ton just by capitalizing the first letter in each word.
SheepDiversityProjectLaPampaArgentina
Much better, I believe! Still, we could add a few improvements. If you have more similar folders at the same place, it might make sense to add the relevant year. Also, one might argue that this particular name is quite long, so we can break it a little bit by adding underscores.
2022_SheepDiversityProject_LaPampaArgentina
Even better! This is of course from my own perspective. You are free to spin your own naming conventions around and experiment. There is also a lot of guidelines on this available on the web that give helpful suggestions. A nice and extensive summary can be found here.