The first part of today’s lab introduced you to Linux, files and the shell. Now we want to see why Linux is so popular (~90% market share) among bioinformatics developers and users.
There are several reasons, and now we’ll see how flexible it is the shell to be used to produce analytical pipelines.
A few more commands…
This command will print (=display) the content of a file to the terminal, provided that it’s a simple text file! Go to the home directory and try cat with the simple text file you produced with gedit. You should see it without problems. Then try it with the PDF. What can you see?
The point is that simple text files contains… just a string of characters. Complex files (word documents, excel files, images ecc) store information in a much more complex way.
The good thing is that 90% of bioinformatics formats are simple text files. You should know at least two formats: FASTA and SAM. Try cat with the file you downloaded with wget.
This command stands for “word count”. If used with “-l” (for lines) will return the number of lines in a file. Try “wc -l your_file” on the simple text file you prepared and, again, with the file you downloaded with wget.
head / tail
Consider a very long file, it is useful to just read the begin/end of it. With head you’ll print the first 10 lines, with tail the last 10. With “head -n 20 file_name” you’ll print the first 20 lines instead, and so on. Try with the file you downloaded with wget.
This incredibly powerful tool can search a pattern in text files. The simplest way to use it is “grep string filename” where string is the patter you want to look for and filename the file to be scanned for the pattern. It’s like “Find” in a word processor. But can do much more 🙂
Piping means to put a program after another, making the output of the first becoming the input of the latter. Let’s try some example:
cat some_file | wc -l
This means: print the file (cat) then pass the output (i.e. all the lines) to wc and let it count. A simple “wc -l some_file” would make the thing work, but this makes you see what piping is.
cat some_file | grep “some pattern” | wc -l
This means print the file, then pass the lines to grep. Grep will discard all the lines not containing the pattern, passing the ones containing it to wc. In one line this means: count how many lines contain the pattern.
Question: how to count how many sequences are present in a multi fasta file?
Please, see this post to learn about output redirection (meaning to save the output of a program into a file). It’s so important to be able to save things!
Now you should be a master. Congrats!