4.5 Project - Taxonomy Profiling | CURE: Microbial Mysteries

4.5 Project - Taxonomy Profiling

4.5.1 Purpose

To use a variety of Galaxy tools to perform Quality Control (QC), sequence quality filtering, taxonomy profiling, and visualization of a metagenomics soil sample sequenced with long-read Nanopore technology.

4.5.2 Learning Objectives

In this exercise, using tools on Galaxy you will be:

Performing QC and quality filtering of your soil metagenomics data with the NanoPlot and fastp tools.
Running a workflow to perform taxonomy profiling and visualization of a soil metagenome.

Throughout these objectives you will be comparing soil and gut metagenomes.

4.5.3 Introduction

Note, the total time for a Galaxy step to complete depends on and will increase based on multiple factors such the input file size, a long queue when many other people are analyzing data, the complexity of the job itself and errors. See the table below for the minimum time a step will take for this assignment – be sure to start early as when Galaxy is busy each step can take 2-to-10 times longer to complete.

Note, that you can save time by:

Submitting multiple jobs that use the same input (NanoPlot and fastp)
Submitting a job like the taxonomy workflow that uses the fastp output as input as soon as the output appears in your history even before the fastp job finishes.

Table of approximate minimum times for a job to be completed on Galaxy using specified tools.

Nanoplot	fastp	Taxonomy workflow
15 min	15 min	30 min

4.5.4 Activity 1 – QC

Estimated time: 50 min

4.5.4.1 Activity 1 - Part I - Import data and run NanoPlot

4.5.4.2 Instructions

Import the dataset into Galaxy.

Open the nanopore-soil-pilot public history https://usegalaxy.org/u/valerie-g/h/nanopore-soil-pilot-1
Click on Import this history, select Copy only the active, non-deleted datasets and then Copy History.
Confirm Nanopore-soil-pilot-subset exists in your history by clicking on the Home button on top left.

Run the NanoPlot tool in Galaxy to assess sequence quality using the default settings.

Click on the Tools icon. Then, in the search bar enter ‘NanoPlot’ and select the NanoPlot tool.
Under files browse to select your Nanopore-soil-pilot-subset fastq dataset.
Click on Run Tool and wait ~10 minutes as the NanoPlot job is scheduled, run, and completed.

4.5.4.3 Questions

1. Click on the Display icon (eyeball) next to the NanoPlot output NanoStats report and record:
Read mean length (mean read length):
Read Mean quality (mean_qual):
Proportion of reads with quality > Q20 (Reads > Q20):

2. Compare the NanoPlot results with the gut data.

	Nanopore soil pilot (this activity)	Zymo gut standard (taxonomy profiling pre-lab)
mean read length:
mean_qual:
Reads > Q20:

3. Which dataset has better sequence quality, the Zymo-gut-standard (taxonomy profiling pre-lab) or the Nanopore-soil pilot (taxonomy profiling project)? Why?

4.5.4.4 Activity 1 - Part II - Quality filtering with fastp

4.5.4.5 Instructions

Although the majority of bases in the soil dataset are of high quality ( >Q20, or 1 in 100 base error), we can filter out very low quality reads to further improve dataset quality. In this activity run fastp tool to filter out reads with a high proportion of low quality bases (<Q15) using the default settings.

Click on the Tools icon. Then, in the search bar enter fastp and select the fastp tool.
Click on Run Tool and wait ~10 minutes as the fastp job is scheduled, run, and completed.

4.5.4.6 Questions

1. Compare your dataset BEFORE and AFTER filtering using fastp: HTML report output.

	Before	After
Mean Length
total reads
total bases
Q20 bases (%):
Q30 bases (%):

2. Compare your BEFORE and AFTER quality plots from activity above. Use the charts in the section titled Filtering Statistics. What key quality improvement can you observe after fastp quality filtering. `Hint 1`: look at 5’ end, 3’ end and the middle? `Hint 2`: Pay attention to the y-axis.

4.5.5 Activity 2 – Taxonomy Profiling

Estimated time: 50 min (~35 min computing)

4.5.5.1 Activity 2 - Part I - Run Taxonomy Profiling Workflow

4.5.5.2 Instructions

Run the ‘Taxonomy Profiling’ workflow on your fastp-filtered data from Activity 1 and view the results.

Open the taxonomy-profiling public workflow https://usegalaxy.org/u/cutsort/w/taxonomy-profiling.
Click on Run.
Browse to select your fastp-filtered fastq dataset “fastp on data1:Read 1 output” dataset by clicking on the ‘...’ tab.
Under kraken_database select Prebuilt Refseq indexes: PlusPF(Standard plus protozoa and fungi)(Version:2022-06-07 - Downloaded: 2022-09-04T165121Z).
Click Run Workflow.
Wait ~30 minutes as the Kraken 2, KrakenTools, and Krona jobs are scheduled, run, and completed.

Click on the Display icon (eyeball) next to the output file with converted_kraken_report. Explore the metagenomic diversity of the soil sample by performing the taxonomy profiling spreadsheet activity you did during week 1.

Click on converted_kraken_report, find the download button and download the report.
Change the extension of your taxonomy file from .tabular to .tsv.
Upload your taxonomy .tsv file to Google Drive and open it with Google Sheets.
Create a header row and enter the following column information.

Col A = Counts.
Cols B-H correspond to taxonomic ranks k(Kingdom), p(Phylum), c(Class), o(Order), f(Family), g(Genus) and s(Species).
Each row corresponds to a different taxa.

Evaluate what proportion of data was taxonomically classified.

Insert a new column A; we will use this temporary column for calculations, so you can name this column “Calculations”.
In e.g. cell A2, calculate the sum of all reads observed in the soil sample.

4.5.5.3 Questions

1. How many total read counts are there?

2. Determine the percentage of reads that are unclassified

3. What percentage of reads are classified?

4. Identify the most abundant taxa (those at >0.1%).

Remember, soil is one of the most diverse microbial environments with many more microbial species than in the gut. Therefore, abundant species can still be quite low abundance.

Select columns B through I.
In the Data menu, select “Sort range by column B (Z to A)”.
Insert a new column C; we will use this temporary column for calculations; you can name this column “% abundance”.
In new column C, calculate % abundance for each row by dividing each count value by the total number of reads and multiplying by 100.

4A. How many ‘abundant’ taxa (at > 0.1%) do you observe?

4B. What are the taxonomic ranks of most abundant taxa?

4C. What is the most abundant eukaryote observed and its read count?

4D. What is the most abundant archaea observed and its read count?

4E. What is the most abundant virus observed and its read count?

4.5.5.4 Activity 2 - Part II - Analyze Kraken 2 results

4.5.5.5 Instructions

Click on the Display icon (eyeball) next to the output file with kraken2_with_pluspf_database_output_report. This output report is an extended version of the converted_kraken_report. The output contains 6 columns. See info for select column headers below:

Column 1: Percentage (%) of a given taxon
Column 2: # of reads per given taxon
Column 4: A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Note, that in this extended file, some rank codes will have numbers associated with them; Ignore this aspect of the document for the moment.
Column 6: Identified taxa/scientific name.

Of note: The benefit of kraken2_with_pluspf_database_output_report is that it summarizes converted_kraken_report and calculates summary percentages for taxonomic ranks. For example, your converted_kraken_report has hundreds of lines for phylum Proteobacteria, while kraken2_with_pluspf_database_output_report has 1 line summarizing the percent abundance of all Proteobacteria.

4.5.5.6 Questions

1. What is the percentage of Unclassified taxa? Does it match your calculations in Activity 2 - Part I?

2. What percentage of bacteria is Proteobacteria, the most abundant Phyla observed?

3. What is the most abundant class observed and at what percentage?

4.5.5.7 Activity 2 - Part III - Krona Pie Chart

4.5.5.8 Instructions

Krona pie chart is one of the outputs of the taxonomy workflow, and it is an interactive visualization tool for exploring the composition of metagenomes.

View the Krona results: Click on the Display icon (eyeball) next to the output file named krona_pie_chart.
Double click on Bacteria kingdom (k_Bacteria) to explore further.
Answer questions below

4.5.5.9 Questions

1. What are the 2 main phyla you observe?

2. What appears to be the more diverse phyla of the two?

3. Use Krona and/or Kraken2 outputs to compare your taxonomy from this soil sample, to the gut taxonomy results from your taxonomy-prelab (Zymo-gut-standard ZymoBIOMICS® Gut Microbiome Standard

3A. Fill out the comparison table below

	Nanopore soil pilot	Zymo gut standard
What are 2 most abundant phyla
What are 2 most abundant species
% Classified taxa
% Unclassified taxa

3B. Discuss taxonomy diversity between soil and gut, providing 3 points:
1.
2.
3.

4.5.6 Grading Criteria

Download as Microsoft Word (.docx) and upload on Canvas.

4.5.7 Footnotes

Resources

Google Doc
Species composition in the Gut Microbiome Standard dataset: ZymoBIOMICS® Gut Microbiome Standard

Contributions and Affiliations

Valeriya Gaysinskaya, Johns Hopkins University
Frederick Tan, Johns Hopkins University

Last Revised: May 2025