Chapter 3 Basic software
Learning outcomes At the end of this chapter you will be able to recognize some of the software tools that do not come as pre-installed with your operating system, but they are useful in handling and analysis of genomic data.
If you have not done any genomic data analysis on your computer before, you probably have only the default set of programs installed. My guess is MS Office or Open Office Suite for your general office needs, and the Windows Notepad if you ever need to open a .txt file. You also probably use Windows Explorer to move around the files or check search for them if you are not certain where that pesky little document is located that you need to email to somebody.
There is nothing wrong with this setup if you use your PC for office work. If you want to work with genomic data, however, you can do better.
So before we jump into the specifics about genomic data, I want to talk about my recommendations on which programs you should have on your computer. These recommendations come from my own experience of working with genomic data, as well as data handling and manipulation.
There are many possibilities for the software you can use, and these are my personal favorites. I will talk about these below and briefly explain why I like them.
I also want to add that these recommendations are written for an average MS Windows user, as I assume most of you, dear learners, work in this OS. Sadly, I am not aware of all the possibilities for Mac and Linux OS, although versions or alternatives should exist.
3.1 Text Editors
The genotype files we will be dealing with are nothing else, but large text files. Huge text files, sometimes. So it is inevitable to have text editors to open them. Yes... You want to use programming tools and scripts to manipulate the files, but you also want to have the possibility to open them and see if the contents are according to your expectations. For this, you need a text editor. And no, the default Windows Notepad is not a good tool to do this. A few brave souls even try to use the WordPad, which is even worse.
As I mentioned before the genotype text files are often much larger than you normally deal with. Instead of a few kilobytes, they often have a few hundred megabytes or even a few gigabytes. In my experience, the default tools struggle to open these.
You can do better...
My first go-to program to open large text files is TextPad. This wonderful piece of software opens files in size of gigabytes in a matter of seconds. It can also do a lot of other things, such as file comparisons and some great text management moves, which I do not fully utilize (meaning: I use it just to open and look at the files).
The second text editor I frequently use is Notepad++. I like it, in particular, to look at various scripts, as the keywords in various programming languages are highlighted via its color-coding system. Also, a lot of my time was spared via its button "Find [text] in All Opened Documents", which searches multiple files for expression and gives a clear overview of findings. Can not recommend it enough... Also, somehow I like the visuals of this editor better.
3.2 File management
We have to be very honest here...
Most of the people who use the computer for "ordinary" school or office work are not aware of the file structure and where are the files stored on the computer. This is of course normally not a problem, but it becomes an issue of sizeable proportions when you want to get even half-serious about genomic data analysis. You can not even imagine how many times I had to remind my students that the Desktop is not the place to store your files and that they need to be aware of the file structures and full file names they are working with. I honestly think we are spoiled with the Windows (and probably also Mac) operating systems, where everything just works even if we just click Next > Next > Next and accept the default settings.
As an aspiring learner to analyze genomic data you can of course do better...
The first thing you can do is to install Total Commander for your file management needs. This is a great tool that shows you the HDD and file structure of your computer, allows you to copy and move files with ease. It also has a built-in file packing and extraction tool, so you do not need to worry about having extra software to install for that either. Another crucial thing is that it allows you to see and edit file extensions! As you might or might not know, in Windows each file comes with three, or sometimes four-letter file extensions which are conveniently hidden in the default Windows Explorer. This of course is not a problem for the common use of the computer but could be (read: frequently is) a source of errors in data analysis scripts for beginners. So again, I can not recommend this enough...
If you start to get low on hard disk space, a great program to utilize is WinDirStat to visualize HDD use via color-coded boxes that are proportional to the size of the files. It makes it very easy to see if you have any large files the removal of which would help you in any way.
Folder structure and backup
Before you start any kind of analysis you need to think about the structure you will be using.
One more recommendation for file management is to use cloud storage for your crucial (or even all of your) files. Personally, I use Dropbox, but of course, there are plenty of other services with the same functionality. It is especially important to have a reliable backup for your script files and programs you will be developing during your work. Typically these script files are really small, so you have more than enough space to store them even in the free versions of cloud storage services. The amount of time you put into them, however, makes them extremely valuable. So you do not want to lose them.