Today you’re going to take a “self serve” lab. Skim this post and decide where to focus (no particular order suggested/required). We recommend to review thing you found difficult/no clear. For technical aspects feel always free to ask us. We suggest to dedicate at least some time, if not the whole afternoon, to riboclass data analysis.
If you need to write notes, create a day4.txt in the dir-01 directory 🙂
Camera, don’t forget about it
Yesterday you submitted some jobs to the CAMERA portal. Try seeing if they have been processed. If this is the case create a “camera” directory in your home, and save the results there (clicking on the “Download” link that appears near each completed job).
Have a look at the results and see if anything interesting appears. In particular: are the results expected? If no, why? Annotate a brief (~5 lines) report on a file called “notes.txt” in the “camera” directory.
No results? Don’t worry: you’ll be reminded to retrieve them next week 🙂
Alignments, and file formats (I)
You have two set of reads: the gDNA and rRNA ones. Consider you have to BLAST the two datasets: which database would you consider for each?
Now you know about the “-m 8” format of BLAST and the SAM format. Review the two using the “less” command:
- How reads are encoded in the FASTQ format (see the two subset of reads in your files directory).
- How new alignment programs store alignments in SAM format (see pass_output.sam you used yesterday)
- How BLAST store information in the tabular format (see as usual the files directory).
We used SAM as the input for “riboclass”, we are going to use BLAST -m 8 for Megan. But we produced the two files using different programs, and possibly different parameters.
- How would you transform BLAST to SAM considering that riboclass cares only about read name (why?) and reference name (obviously)?
- Try piping the cat command (with a BLAST output!) with the blast2pass command….
gDNA, the neglected reads
- Consider the gDNA run now. How much rRNA-related reads do you expect to find? (Try guessing a percentage).
- Now, should we discard all the rest? If not, how could we take into account them? (Suggest more than one strategy!)
To play around with the shell again consider the orfinder script that extracts the largest ORF from a sequence and can be “piped” as follows:
cat sequences.fasta | orfinder 32 > proteins.fasta
Where the “30” is the minimum length for the peptide to be extracted. Use gDNA reads you have in your home, and remember to save the output in the dir-01 directory. Now open the output with a text editor (gedit) and select a bunch (at least 30) of sequences to BLASTP.
See what i got with a random read:
Of course many reads will not align against a known match, but BLAST will highlight them with an asterisk:
Annotate interesting results keeping track of which protein you found and possibly the organism it belongs to.
- Final question: are there ORFs in the ribosomal RNAs? How can you answer? Try investigating this.
Ribosomal classification, looking at the results
Yesterday you used a program to classify aligned reads. Open the results with “gedit” and try to see if is fitting your expectations and if you can understand it. Remember that “riboclass” uses just two information: (1) the NCBI taxonomy tree and (2) the SAM file you provide. Then, with a “LCA” approach assign each aligned read to a node of the taxonomical tree.
The output is a classification of single reads: the numbers you see refer to reads aligned.
- Can a single read be aligned onto several 16S sequences? Why?
- If this happens, what will “riboclass” do?
- Look at the output. The organism you find mentioned are expected to be found in a canal?
Now try the riboclass program with two novel (and more comprehensive) alignments:
Execute the riboclass program on both and save the results.
Analyze them and try spotting out the difference among the two.