4.3 Prelab - Taxonomy Profiling

4.3.1 Purpose

To use a variety of tools on Galaxy to perform Quality Control (QC), taxonomy profiling, and visualization of a metagenomics sequencing data.

4.3.2 Learning Objectives

Use tools on the Galaxy platform to:

  1. Perform Quality Control (QC) on raw data by checking the quality of your raw reads with tool NanoPlot.
  2. Assign taxonomy labels to your reads with tool Kraken 2.
  3. Visualize your metagenomic profiling using tool Krona to generate pie chart.

4.3.3 Introduction

To find out which microorganisms are present in the sample, it is important to have high-quality DNA sequences. To ensure high-quality sequence input, QC (and in many cases also read trimming and filtering) are routinely performed on raw sequences. The reads can then be used for taxonomic classification. To assign taxonomy, we can compare the reads of the sample to a reference database, i.e. sequences of known microorganisms stored in a database, using Kraken 2, which is a k-mer based taxonomic assignment tool. We can then use a visualization tool Krona to interactively visualize and explore the composition of a metagenome.

4.3.4 Activity 1 – QC Reads

Estimated time: 50 min

The sample used in this activity is the Zymo Gut Microbiome Standard, sequenced by Pacific Biosciences using PacBio Sequel II Instrument, and corresponds to sequencing read file SRR13128014. A subset of this data is used in this Activity.

4.3.4.1 Activity 1 - Part I - Import the dataset into Galaxy

Estimated time: 15 min

4.3.4.2 Instructions

  1. Import the dataset into Galaxy.
  1. Open the zymo-gut-standard public history https://usegalaxy.org/u/valerie-g/h/zymo-gut-standard-d6331-subset-1

  2. Click on Import this history, select copy only the active, non-deleted datasets and then Copy History.

  3. Confirm Zymo_Gut_Standard_D6331_subset exists in your history by clicking on the Home button “Galaxy” on top left ().

  4. Click on Zymo_Gut_Standard_D6331_subset to explore its contents.

4.3.4.3 Questions

1. What is the size of this downloaded dataset subset?

2. What is the format/extension of the downloaded file?

3. Click on the Display (eyeball) icon and describe what you see in the 4 lines of the fastq file:
Line 1:
Line 2:
Line 3:
Line 4:

4.3.4.4 Activity 1 - Part II - Run Nanoplot in Galaxy to assess sequence quality

Estimated time: 15 min

4.3.4.5 Instructions

  1. Run Nanoplot in Galaxy.
  1. Explore the NanoPlot tool parameters - click on the Tools icon on the left of the page. Then, in the search bar enter ‘NanoPlot’ and select the NanoPlot tool. Explore NanoPlot functionality by examining Tool Parameters. Answer related questions below.

  2. Run NanoPlot using default settings. Under Tool Parameters, check the following settings:

  • Under files there are 4 options to select a fastq dataset: Single dataset, Multiple datasets, Dataset collection or option ‘…’ which is Browse or Upload Datasets.
  • Browse to select your fastq dataset. Note: Galaxy tool may pre-select the correct dataset already for you so just make sure that the file is correct.
  • Click on Run Tool and wait ~5-10 minutes as the NanoPlot job is scheduled, run, and complete. Answer related questions below.

Image test

4.3.4.6 Questions

1. Under Type of file(s) to work on, check to see what input files are compatible with NanoPlot and name 2 file extension options listed.
File extension name 1:
File extension name 2:


2. Click to expand Options for filtering or transforming input prior to plotting and name 3 options you could use to filter your sequencing data.
1.
2.
3.


3. Run Nanoplot using the default tool settings and record how many output files you obtained after running NanoPlot and list their names.
# of output files
Names of output files


4.3.4.7 Activity 1 - Part III - View the NanoPlot Results in Galaxy

Estimated time: 15 min

4.3.4.8 Instructions

  1. View and examine the NanoPlot Results in Galaxy by clicking on the Display icon (eyeball) next to the NanoPlot output.

4.3.4.9 Questions

  1. Examine the NanoPlot output results.
  • Click on the Display icon (eyeball) next to the NanoPlot output files to view results.
A. How many bases were sequenced?


B. Why is the mean read length longer than the median read length? - Hint: think skewness https://wikipedia.org/wiki/Skewness


C. Record Reads >Q20 metric value. Given that Q20 quality (Phred) score corresponds to a read accuracy of 99% (or 1 in 100 errors), do you think this dataset is of a good sequence quality?


  1. Examine the NanoPlot output HTML report.
  • Click on the Display icon (eyeball) next to the NanoPlot output HTML report.
A. Scroll down to view the ‘Weighted histogram of read lengths’ histogram. From this plot estimate the range of read lengths obtained.


B. Scroll down to view the ‘Yield by length’ cumulative plot which shows sequencing yield based on read length. From this plot do shorter (10kb or less) or longer sequences produce more data?


4.3.5 Activity 2 – Taxonomy Profiling in Galaxy

Estimated time: 50 min

4.3.5.1 Activity 2 - Part I - Run ‘Taxonomy Profiling’ workflow in Galaxy

Estimated time: 15 min

4.3.5.2 Instructions

  1. Run ‘Taxonomy Profiling’ public workflow.
  1. Open the taxonomy-profiling public workflow https://usegalaxy.org/u/cutsort/w/taxonomy-profiling and click on Run.

Image test

  1. Browse to select your fastq dataset by clicking on the ‘…’ tab.

  2. Under kraken_database select Prebuilt Refseq indexes: PlusPF(Standard plus protozoa and fungi)(Version:2022-06-07 - Downloaded: 2022-09-04T165121Z).

  3. Click Run Workflow.

  4. Wait ~30 minutes as the Kraken 2, KrakenTools, and Krona jobs are scheduled, run, and complete. While you are waiting, continue with the next section of this prelab.

  • Note, that it can take longer than 30 minutes to run the scheduled jobs in part, because Galaxy servers are public shared resources used concurrently by numerous users, so that when many users demand resources like CPU, memory, or disk space, it can create bottlenecks and delays, affecting the speed of your particular job.

Image test

  1. Examine select aspects of Kraken 2 tool.
  1. While the Taxonomy Workflow is running, click on the Tools icon on the left of the page.

  2. Then, in the search bar enter ‘Kraken 2’’ and select the Kraken2 tool.

4.3.5.3 Questions

A. How does Galaxy describe the Kraken 2 tool in its descriptor on top of the page?


B. Scroll down to the Help/What it does section, and in your own words paraphrase the paragraph describing how Kraken 2 works.


C. Record how many output files you obtained from the Taxonomy Workflow and list their names.


4.3.5.4 Activity 2 - Part II - View Kraken 2 results

Estimated time: 15 min

4.3.5.5 Instructions

  1. Open and examine converted_kraken_report.
  • Click on the Display icon (eyeball) next to the output file with converted_kraken_report. This report should look familiar from the week 1 taxonomy-profiling-spreadsheet activity.
  • Scroll through to explore.
  1. Open and examine the kraken2_with_pluspf_database_output_report.
  • Click on the Display icon (eyeball) next to the output file with kraken2_with_pluspf_database_output_report.
  • This output report is an extended version of the converted_kraken_report.
  • The output contains 6 columns. See info for select column headers below:
    Column 1: Percentage (%) of reads that map to a given taxon and its descendants
    Column 2: # of reads assigned to a given taxon and its descendants
    Column 3: # of reads assigned directly to a given taxon (not its descendants)
    Column 4: A rank code (see explanation below)
    Column 6: Identified taxa/scientific name
 
**Rank code:**
    - (U)nclassified, 
    - (R)oot, 
    - (D)omain, 
    - (K)ingdom, 
    - (P)hylum, 
    - (C)lass, 
    - (O)rder, 
    - (F)amily, 
    - (G)enus, or 
    - (S)pecies. 
    
*Note, some rank codes will have numbers associated with them; Ignore them for the moment.*

4.3.5.6 Questions

  1. Answer the following questions about the converted_kraken_report.
1A. How many Unclassified reads are there?


1B. How many Kingdoms are there and what are they?


1C. How many Phyla are there and what are they?


1D. Using the total number_of_reads you obtained from the NanoStats (NanoPlot) metrics, and the value of Unclassified reads from the converted_kraken_report, calculate the % unclassified and % classified taxa.


1E. Click on the converted_kraken_report entry and look just below the ‘Add Tags’ to see the number of lines and columns in the file. The number of lines corresponds to the number of taxa detected. Excluding the Unclassified taxa, how many taxa were identified?


  1. Answer the following questions about the kraken2_with_pluspf_database_output_report.
2A. What is the percentage of Unclassified taxa listed? Does it match what you calculated in section 2-2.1?


2B. What is the percentage of Classified taxa? Does it match what you calculated in section 2-2.1?


  • Note, the document lists (U)nclassified taxa only, so you’ll need to get the percentage of Classified taxa based on that.
2C. Find and record the 3 most abundant Phyla (p_) by percentages. As rows are not sorted by abundance, you may find it helpful to search using for “P” using -F to highlight rows corresponding to phyla.


  1. View Krona Results and answer the following questions.

Krona pie chart is one of the outputs of the Taxonomy workflow, and it is an interactive visualization tool for exploring the composition of metagenomes. Click on Display icon (eyeball) next to the Krona_pie_chart dataset to explore the results as a Krona pie chart.

Image test

3A. What are the 2 main phyla you observe?


3B. What appears to be the more diverse phyla of the two and why?


3C. Examine how many reads and percent of reads were classified as phylum Firmicutes.


3D. What is one of the most abundant Orders in phylum FIrmicutes based on number and % of reads?


3E. Within the most abundant Order from your answer above, record one Family member, one genus member and one species member; also include their percent abundances.


3F. Some potentially pathogenic bacterial species are often present in a healthy gut, but in relatively low amounts. What % of bacteria is represented by C. difficile (Clostridiodes difficile) in this gut standard sample? How does it compare to the expected proportion in the ZymoBIOMICS® Gut Microbiome Standard?


4.3.6 Grading Criteria

  • Download as Microsoft Word (.docx) and upload on Canvas.

4.3.7 Footnotes

Resources

Contributions and Affiliations

  • Valeriya Gaysinskaya, Johns Hopkins University
  • Frederick Tan, Johns Hopkins University

Last Revised: May 2025