Dear all, if you wish to see the correct answer for last Friday’s test… 

Metasolved Metagenomics Metatest

1. The file “appello.txt” is a list of the names of students of this course. Each line corresponds to one name.

(a) Write the shell commands to print this list on the video terminal

cat appello.txt

(b) What is the command to count the number of names in the list?

cat appello.txt | wc -l

2. A service provider just sent you the results of the sequencing run of your sample. The DNA that you sent was that of a bacterium that grows in cucumbers, and you wanted to sequence its whole genome. Given that the results were delivered as a FASTQ file, how can you determine the number of sequences contained in that file?  

a) Write down the solution that you found

In the FASTQ file each sequence is described by 4 lines:
  • line 1 begins with a ‘@’ character and is followed by a sequence identifier
  • line 2 is the sequence
  • line 3 begins with a ‘+’ character
  • line 4 encodes the quality values for the sequence in line 2

Therefore I will print to video terminale the file, I will count the lines and divide the result by 4.

b) Report the command line

cat sequences.fastq | wc -l

 3. The sequencing service sent you the FASTQ file containing the results of a sequencing run where they have loaded some amplicons that you submitted. The file contains 20,309,291 sequences. You want to count how many of them contain the primer that you have used for amplification (GTGCCAGCAGCCGCGGTAA). How can you do that using shell commands?

cat sequences.fastq | grep ‘GTGCCAGCAGCCGCGGTAA’ | wc –l

4. You have the sequences of gDNA that we produced during the didactic laboratories.

(a) Which of the following blast commands is appropriate if you want to align your sequences against a 16S database?

Screen Shot 2013-05-17 at 3.54.06 PM

(b) Which of the following blast commands is appropriate if you want to align your sequences against a database of protein sequences?


5. Assign to the following tasks the correct action. Type 1 for NCBI BLAST and 2 for BLAST local using command line:

(a) I want to align the sequence of a plasmid of unknown source [1]

(b) I want to align a sequence against the genome of the bacterium that I just sequenced [2]

(c) I want to align all the sequences described at question n3 against a database of 16S [2]

6. You have a file “sequenze_gdna.fasta” that contains all the reads of gDNA produced during the didactic laboratories. You want to align those reads against the database of bacterial genomes completely sequenced (“bacterial_genomes.fasta”).

(a) List the parameters that you have to pass to BLAST necessary

 The necessary parameters for blastall to work are: the type of alignment that I want to perform (e.g. blastp, blastn..); the input file or query; and the reference database.

(b) Write the command line to lunch blast and save the results in the file “aligned.txt”.

blastall –p blastn –i sequenze_gdna.fasta –d bacterial_genomes.fasta

7. Which of the following databases (from the list below) is appropriate in each of the following cases (more than one associations allowed):


[a] gDNA sequences produced during the didactic laboratories [5; 1]

[b] rRNA sequences produced during the didactic laboratories [1] 

[c] the sequence of a protein known to be toxic to humans and found in some bacterial strains [2; 5]

[d] the sequences obtained after the amplification of the HLA locus of the students of this class [3]

Available Databases:

[1] 16S genes of known bacteria                    [2] cucumber bacteria whole genome                       [3] human chromosomes                                [4] NCBI nr database                                               [5] all the available bacterial genomes

8. Look carefully at the data reported below and answer the questions:

Results of the classification of the sequences obtained from the sequencing of gDNA from sample “S1685” Results of the classification of the sequences obtained from the sequencing of 16S amplicons from sample “S1685”. Primers were selective for eubacteria
Thaumarcheota                            1,357 

Arcobacter                                 999 

Arcobacter butzleri                        505

Arcobacter cryaerophilus                   482

toluene-degrading bacterium UCR 021t       263

toluene-degrading bacterium UCR 021e       193

Epsilonproteobacteria                      188

Salmonella enterica                        150

Arcobacter sp. F79-6                       107

Escherichia coli                           106 

Arcobacter                             686,640 

Arcobacter butzleri                     31,160

Arcobacter cryaerophilus                22,760

Epsilonproteobacteria                    5,187

Enterobacteriaceae                       2,441

Escherichia coli                           890  

(a) Do you observe differences?

  1.  I see more variety of organisms classified in the gDNA samples comparing to the 16S
  2. When a certain variety of organism has twice as many reads than another in the gDNA sample (e.g. Arcobacter 999; Arcobacter butzleri 505) in the 16S samples I find these same two organism varieties with a difference in abundance of 10-20 times (e.g. Arcobacter 686,640 ; Arcobacter butzleri 31,160)
  3. The absolute number of alignments per each group of organisms in higher in the 16S samples than in the gDNA

(b) Could you explain / advance hypothesis concerning the differences that you observe?

  1. The primers are specific for eubacteria, therefore “Thaumarcheota”, being Archea were not amplified. Other organisms, despite boing eubacteria, were not amplified efficiently either.
  2. The proportionality of the reads were altered by the PCR amplification.

9. You purified the total nucleic acids (gDNA and totalRNA) from an environmental sample. You want to perform PCRs on this sample and you want to know the exact amount of template DNA that you will put in each reaction tubes.  How would you quantify the purified DNA to achieve this goal (chose between the methods that we discussed during the labs). Can you explain your choice?

I would either use the qubit DNA kit, that quantifies the fluorescence emitted by a fluorophore that binds specifically to DNA and does not register the RNA signal, or the gel electrophoresis, where I can look specifically at the fluorescence of the DNA band and compare it to a band of known concentration.

10. You want to identify the bacteria that are responsible for gasification of organic waste into methane. You’re given the possibility to collect samples from a productive plant to perform your studies. In this plant, every Monday the bioreactors are filled with organic waste, are locked for anaerobiosys and their methane emission is measured for one week. The plant consists of tree bioreactors. In each of them the production of methane begins after a different period every time. This week bioreactor 3 started producing methane after 2 days, bioreactor 2 after 3 days and n 1 after 5 days. You aim at identifying the bacteria responsible for this fermentation in order to inoculate them every Monday together with the waste and boost the plant productivity by anticipating the methane emission at day 1 in all the bioreactors. When (which day) and where (which bioreactor) would you collect the samples to perform your studies? How would you store them if it is necessary to wait before running your metagenomic experiments?

I would collect samples from all the 3 bioreactors at day 1, suddenly after inoculum of the waste. I would then collect one sample from each bioreactor suddenly after production of methane has started. With this experimental set-up I should be able to identify the species whose concentration increases specifically at the moment of methane production in comparison with the moment of waste inoculation. Comparison of the three samples would be crucial to obtain a statistically significant result, moreover it may rise some hints about the different time course of the three inocula (e.g. they might have different starting concentrations of the useful bacteria or of competitor bacteria that grow on the same substrate but do not emit methane).
I would flash freeze the samples immediately after collection and keep them at -80°C until the moment of extraction. I would thaw them either directly in extraction buffer or in a buffer that protects nucleic acids from enzymatic degradation.

One thought on “Exam

  • Elisa Corteggiani

    We completed the correction of your tests. I have just one observation/advice for the all of you:
    the question that had the worst answers was the n 8. The most common problem was the lack of observation of the data in all their details. There were also some problems with the interpretation of the data, with many people driving conclusions that it was not possible to drive (given you did not have enough data for those conclusions) and missing the conclusions for wich there was enough evidence. Advice for everyone: train yourself at this kind of tasks, it will be definitely a good investment!

Leave a Reply

Your email address will not be published. Required fields are marked *