Merge pull request #5059 from EngyNasr/PathogenTrainingJune2024Update2

Pathogen detection from (direct Nanopore) sequencing data using Galaxy - Foodborne Edition
galaxyproject · Jun 20, 2024 · 374441d · 374441d
2 parents 61e7b9b + fea16c6
commit 374441d
Show file tree

Hide file tree

Showing 3 changed files with 28 additions and 16 deletions.
diff --git a/...icrobiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.md b/...icrobiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.md
@@ -171,7 +171,7 @@ We will run all these steps using a single workflow, then discuss each step and
 >    {% snippet faqs/galaxy/workflows_import.md %}
 >
 > 2. Run **Workflow 1: Nanopore Preprocessing** {% icon workflow %} using the following parameters
->    - *"Samples Profile"*: `PacBio/Oxford Nanopore read to reference mapping`
+>    - *"Samples Profile"*: `PacBio/Oxford Nanopore read to reference mapping`, which is the technique used for sequencing the samples.
 >
 >    - {% icon param-files %} *"Collection of all samples"*: `Samples` collection created from the imported Fastq.qz files
 >
@@ -206,9 +206,17 @@ In this tutorial we use similar tools as described in the tutorial ["Quality con
     >            - {% icon param-files %} *"Data input files"*: `Samples` collection created from the imported Fastq.qz files
     >
     >    > <comment-title></comment-title>
-    >    > This step, as it does not require the results of FastQC to run, can be launched even if FastQC is not ready
+    >    > The `NanoPlot` step, as it does not require the results of FastQC to run, can be launched even if FastQC is not ready
     >    {: .comment}
     >
+    > 3. {% tool [MultiQC](toolshed.g2.bx.psu.edu/repos/iuc/multiqc/multiqc/1.11+galaxy0) %} with the following parameters:
+    >    - In *"Results"*:
+    >        - {% icon param-repeat %} *"Insert Results"*
+    >            - *"Which tool was used generate logs?"*: `FastQC`
+    >                - In *"FastQC output"*:
+    >                    - {% icon param-repeat %} *"Insert FastQC output"*
+    >                        - *"Type of FastQC output?"*: `Raw data`
+    >                        - {% icon param-files %} *"FastQC output"*: collection of `Raw data` outputs of **FastQC** {% icon tool %}
     {: .hands_on}
 
     </div>
@@ -226,7 +234,7 @@ In this tutorial we use similar tools as described in the tutorial ["Quality con
     >
     > 2. {% tool [fastp](toolshed.g2.bx.psu.edu/repos/iuc/fastp/fastp/0.20.1+galaxy0) %} with the following parameters:
     >    - *"Single-end or paired reads"*: `Single-end`
-    >        - {% icon param-files %} *"Input 1"*: outputs of **Porechop** {% icon tool %}
+    >        - {% icon param-files %} *"Input 1"*: output collection of **Porechop** {% icon tool %}
     >    - In *Output Options*
     >        - *"Output JSON report"*: `Yes`
     >
@@ -243,12 +251,12 @@ In this tutorial we use similar tools as described in the tutorial ["Quality con
 
     > <hands-on-title> Final quality checks </hands-on-title>
     > 1. {% tool [FastQC](toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.73+galaxy0) %} with the following parameters:
-    >    - {% icon param-files %} *"Raw read data from your current history"*: outputs of **fastp** {% icon tool %}
+    >    - {% icon param-files %} *"Raw read data from your current history"*: output collection of **fastp** {% icon tool %}
     >
     > 2. {% tool [NanoPlot](toolshed.g2.bx.psu.edu/repos/iuc/nanoplot/nanoplot/1.28.2+galaxy1) %} with the following parameters:
     >    - *"Select multifile mode"*: `batch`
     >        - *"Type of the file(s) to work on"*: `fastq`
-    >            - {% icon param-files %} *"files"*: outputs of **fastp** {% icon tool %}
+    >            - {% icon param-files %} *"files"*: output collection of **fastp** {% icon tool %}
     >
     > 3. {% tool [MultiQC](toolshed.g2.bx.psu.edu/repos/iuc/multiqc/multiqc/1.11+galaxy0) %} with the following parameters:
     >    - In *"Results"*:
@@ -257,17 +265,17 @@ In this tutorial we use similar tools as described in the tutorial ["Quality con
     >                - In *"FastQC output"*:
     >                    - {% icon param-repeat %} *"Insert FastQC output"*
     >                        - *"Type of FastQC output?"*: `Raw data`
-    >                        - {% icon param-files %} *"FastQC output"*: 4 `Raw data` outputs of **FastQC** {% icon tool %}
+    >                        - {% icon param-files %} *"FastQC output"*: collection of `Raw data` output of **FastQC** {% icon tool %} done after **fastp**
     >        - {% icon param-repeat %} *"Insert Results"*
     >            - *"Which tool was used generate logs?"*: `fastp`
-    >                - {% icon param-files %} *"Output of fastp"*: `JSON report` outputs of **fastp** {% icon tool %}
+    >                - {% icon param-files %} *"Output of fastp"*: `JSON report` output of **fastp** {% icon tool %}
     {: .hands_on}
 
     </div>
 
 > <question-title></question-title>
 >
-> Inspect the HTML output of **MultiQC** for `Barcode10`
+> Inspect the HTML two outputs of **MultiQC** for `Barcode10` before and after preprocessing tagged `MultiQC_Before_Preprocessing` and `MultiQC_After_Preprocessing`
 >
 > 1. How many sequences does `Barcode10` contain before and after trimming?
 > 2. What is the quality score over the reads before and after trimming? And the mean score?
@@ -278,8 +286,13 @@ In this tutorial we use similar tools as described in the tutorial ["Quality con
 > > 1. Before trimming the file has 114,986 sequences and After trimming the file has 91,434 sequences
 > > 2. The "Per base sequence quality" is globally medium: the quality score stays above 20 over the entire length of reads after trimming, while quality below 20 could be seen before trimming specially at the beginning and the end of the reads.
 > >
+> > Sequence quality of Barcode 10 and Barcode 11 before preprocessing:
+> >
 > >    ![Sequence Quality of Barcode 10 and Barcode 11 Before Trimming](./images/multiqc_per_base_sequence_quality_plot_barcode10_barcode11_before_trimming.png)
 > >
+> >
+> > Sequence quality of Barcode 10 and Barcode 11 after preprocessing:
+> >
 > >    ![Sequence Quality of Barcode 10 and Barcode 11 After Trimming](./images/multiqc_per_base_sequence_quality_plot_barcode10_barcode11_after_trimming.png)
 > >
 > > 3. After checking what is wrong, e.g. before trimming, we should think about the errors reported by **FastQC**: they may come from the type of sequencing or what we sequenced (check the ["Quality control" training]({% link topics/sequence-analysis/tutorials/quality-control/tutorial.md %}): [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for more details). However, despite these challenges, we can already see sequences getting slightly better after the trimming and filtering, so now we can proceed with our analyses.
@@ -294,7 +307,7 @@ In this tutorial we use similar tools as described in the tutorial ["Quality con
 
 Generally, we are not interested in the food (host) sequences, rather only those originating from the pathogen itself. It is an important to get rid of all host sequences and to only retain sequences that might include a pathogen, both in order to speed up further steps and to avoid host sequences compromising the analysis.
 
-In this tutorial, we know the samples come from __chicken__ meat spiked with **_Salmonella_** so we already know what will we get as the host and the main pathogen.
+In this tutorial, we know the samples come from __chicken__ meat spiked with **_Salmonella_** so we already know what will we get as the host and the main pathogen. If the host is not known, **Kraken2** with **Kalamari** database can be used to detect it.
 
 In this tutorial we use:
 1. Map reads to __chicken__ reference genome using **Map with minimap2** and **Chicken (Gallus gallus): galGal6** built in reference genome of __chicken__, and we move forward with the unmapped ones.
@@ -308,7 +321,7 @@ In this tutorial we use:
     >        - *"Using reference genome"*: `Chicken (Gallus gallus): galGal6`
     >    - *"Single or Paired-end reads"*: `Single`
     >        - {% icon param-file %} *"Select fastq dataset"*: `out1` (output of **fastp** {% icon tool %})
-    >        - *"Select a profile of preset options"*: `PacBio/Oxford Nanopore read to reference mapping (-Hk19) (map-pb)`
+    >        - *"Select a profile of preset options"*: `PacBio/Oxford Nanopore read to reference mapping (-Hk19) (map-pb)`, which is the technique used for sequencing the samples.
     >    - In *"Alignment options"*:
     >        - *"Customize spliced alignment mode?"*: `No, use profile setting or leave turned off`
     >
@@ -322,7 +335,7 @@ In this tutorial we use:
     >
     {: .hands_on}
 
-2. Assign filted reads, after mapping (non __chicken__ reads), to taxa using **Kraken2** ({% cite Wood2014 %}) and **Kalamari**, a database of completed assemblies for metagenomics-related tasks used widely in contamination and host filtering
+2. Assign filted reads, after mapping (non __chicken__ reads), to taxa using **Kraken2** ({% cite Wood2014 %}) as a further contamination detection using the **Kalamari** database. The **Kalamari** database includes mitochondrial sequences of various known hosts including food hosts.
 
     <div class="Long-Version" markdown="1">
 
@@ -898,7 +911,7 @@ In this training, we are testing _Salmonella enterica_, with different strains o
 > <hands-on-title>Allele based Pathogenic Identification</hands-on-title>
 >
 > 1. **Import the workflow** into Galaxy
->    - Copy the URL (e.g. via right-click) of [this workflow]({{ site.baseurl }}{{ page.dir }}workflows/nanopore_allele_based_pathogen_identification.ga) or download it to your computer.
+>    - Copy the URL (e.g. via right-click) of [this workflow]({{ site.baseurl }}{{ page.dir }}workflows/allele_based_pathogen_identification.ga) or download it to your computer.
 >    - Import the workflow into Galaxy
 >
 >    {% snippet faqs/galaxy/workflows_import.md %}

diff --git a/...le_based_pathogen_identification-test.yml → ...le_based_pathogen_identification-test.yml b/...le_based_pathogen_identification-test.yml → ...le_based_pathogen_identification-test.yml
@@ -1,4 +1,4 @@
-- doc: Test outline for nanopore_allele_based_pathogen_identification
+- doc: Test outline for allele_based_pathogen_identification
   job:
     Reference Genome of Tested Strain:
       class: File

diff --git a/...e_allele_based_pathogen_identification.ga → ...s/allele_based_pathogen_identification.ga b/...e_allele_based_pathogen_identification.ga → ...s/allele_based_pathogen_identification.ga
@@ -101,9 +101,9 @@
     "format-version": "0.1",
     "license": "MIT",
     "release": "0.1",
-    "name": "Nanopore Allele-based Pathogen Identification",
+    "name": "Allele-based Pathogen Identification",
     "report": {
-        "markdown": "# Nanopore - Allele based Pathogen Identification Workflow Report\nBelow are the results for the Allele based Pathogenic Identification Workflow\n\nThis workflow was run on:\n\n```galaxy\ngenerate_time()\n```\n\nWith Galaxy version:\n\n```galaxy\ngenerate_galaxy_version()\n```\n\n## Workflow Inputs\nThe Perprocessing workflow main output (Collection of all samples reads after quality retaining and hosts filtering), and a FASTA file of the reference genome of the main Pathogen identified in the Gene based Pathogen Identification workflow, or per-known to the user.\n\n## Workflow Output: \n\n### All variants found per sample against the reference genome\n\n```galaxy\nhistory_dataset_display(output=\"extracted_fields_from_the_vcf_output\")\n```\n\n### Number of variants per sample\n\n```galaxy\nhistory_dataset_display(output=\"number_of_variants_per_sample\")\n```\n\n### Mapping mean depth per sample\n\n```galaxy\nhistory_dataset_display(output=\"mapping_mean_depth_per_sample\")\n```\n\n### Mapping coverage per sample\n\n```galaxy\nhistory_dataset_display(output=\"mapping_coverage_percentage_per_sample\")\n```\n"
+        "markdown": "# Allele based Pathogen Identification Workflow Report\nBelow are the results for the Allele based Pathogenic Identification Workflow\n\nThis workflow was run on:\n\n```galaxy\ngenerate_time()\n```\n\nWith Galaxy version:\n\n```galaxy\ngenerate_galaxy_version()\n```\n\n## Workflow Inputs\nThe Perprocessing workflow main output (Collection of all samples reads after quality retaining and hosts filtering), and a FASTA file of the reference genome of the main Pathogen identified in the Gene based Pathogen Identification workflow, or per-known to the user.\n\n## Workflow Output: \n\n### All variants found per sample against the reference genome\n\n```galaxy\nhistory_dataset_display(output=\"extracted_fields_from_the_vcf_output\")\n```\n\n### Number of variants per sample\n\n```galaxy\nhistory_dataset_display(output=\"number_of_variants_per_sample\")\n```\n\n### Mapping mean depth per sample\n\n```galaxy\nhistory_dataset_display(output=\"mapping_mean_depth_per_sample\")\n```\n\n### Mapping coverage per sample\n\n```galaxy\nhistory_dataset_display(output=\"mapping_coverage_percentage_per_sample\")\n```\n"
     },
     "steps": {
         "0": {
@@ -1320,7 +1320,6 @@
         "name:Collection",
         "name:microGalaxy",
         "name:PathoGFAIR",
-        "name:Nanopore",
         "name:IWC"
     ],
     "uuid": "deb94861-ed4d-41fe-881a-8565c6b8fa82",