Merge pull request #26 from ChristyPeterson/main

updated docs
phac-nml · Feb 13, 2024 · f24fe6f · f24fe6f
2 parents 6fdb770 + 0cfa940
commit f24fe6f
Show file tree

Hide file tree

Showing 23 changed files with 1,373 additions and 1,493 deletions.
diff --git a/docs/images/20230630_Mikrokondo-logo_v4.svg b/docs/images/20230630_Mikrokondo-logo_v4.svg
diff --git a/docs/index.md b/docs/index.md
@@ -1,13 +1,18 @@
-
-![Pipeline](images/20230630_Mikrokondo-logo_v4.svg "Logo")
-# Welcome to mikrokondo!
-
-## What is mikrokondo?
-Mikrokondo is a tidy workflow for performing routine bioinformatic tasks like, read pre-processing, assessing contamination, assembly and quality assessment of assemblies. It is easily configurable, provides dynamic dispatch of species specific workflows and produces common outputs.
-
-## Is mikrokondo right for me?
-Mikrokondo takes in either, Illumina, Nanopore or Pacbio data (Pacbio data only partially tested). You can also use mikrokondo for hybrid assemblies or even pass it pre-assembled assembled genomes. Additionally, mikrokondo required minimal upfront knowledge of your sample.
-
-## Workflow Schematics (Subject to change)
-
-![Pipeline](images/20230921_Mikrokondo-worflow2.png "Workflow")
+
+![Pipeline](images/20230630_Mikrokondo-logo_v4.svg "Logo")
+# Welcome to mikrokondo!
+
+## What is mikrokondo?
+Mikrokondo is a tidy workflow for performing routine bioinformatic assessment of sequencing reads and assemblies, such as: read pre-processing, assessing contamination, assembly, quality assessment of assemblies, and pathogen-specific typing. It is easily configurable, provides dynamic dispatch of species specific workflows and produces common outputs.
+
+## What is the target audience?
+This workflow can be used in sequencing and reference laboratories as a part of an automated quality and initial bioinformatics assessment protocol.
+
+## Is mikrokondo right for me?
+Mikrokondo is purpose built to provide sequencing and clinical laboratories with an all encompassing workflow to provide a standardized workflow that can provide the initial quality assessment of sequencing reads and assemblies, and initial pathogen-specific typing. It has been designed to be configurable so that new tools and quality metrics can be easily incorporated into the workflow to allow for automation of these routine tasks regardless of pathogen of interest. It currently accepts Illumina, Nanopore or Pacbio (Pacbio data only partially tested) sequencing data. It is capable of hybrid assembly or accepting pre-assembled genomes.
+
+This workflow will detect what pathogen(s) is present and apply the applicable metrics and genotypic typing where appropriate, generating easy to read and understand reports. If your group is regularly sequencing or analyzing genomic sequences, implementation of this workflow will automate the hands-on time time usually required for these common bioinformatic tasks.
+
+## Workflow Schematics (Subject to change)
+
+![Pipeline](images/20230921_Mikrokondo-worflow2.png "Workflow")
diff --git a/docs/subworkflows/annotate_genomes.md b/docs/subworkflows/annotate_genomes.md
diff --git a/docs/subworkflows/assemble_reads.md b/docs/subworkflows/assemble_reads.md
@@ -1,25 +1,26 @@
-# Assembly
-
-## subworkflows/local/assemble_reads
-
-## Steps
-
-1. **Assembly** proceeds differently depending whether short paired-end or long reads. **If the samples are marked as metagenomic, then metagenomic assembly flags will be added** to the corresponding assembler.
-  - **Paired end assembly** is performed using [Spades](https://github.com/ablab/spades) (spades_assemble.nf)
-  - **Long read assembly** is performed using [Flye](https://github.com/fenderglass/Flye) (flye_assemble.nf)
-
-2. **Bandage plots** are generated using [Bandage](https://rrwick.github.io/Bandage/), these may not be useful for every user, but the can be informative of assembly quality in some situations (bandage_image.nf).
-
->NOTE:
->Hybrid assembly of long and short reads uses a different workflow that can be found [here](hybrid_assembly.md)
-
-3. **Polishing** (OPTIONAL) can be performed on either short or long/hybrid assemblies. [Minimap2](https://github.com/lh3/minimap2) is used to create a contig index (minimap2_index.nf) and then maps reads to that index (minimap2_map.nf). Lastly, [Racon](https://github.com/isovic/racon) uses this output to perform contig polishing (racon_polish.nf). To turn off polishing add the following to your command line parameters `--skip_polishing`.
-
-## Input
-- cleaned reads and metadata
-
-## Outputs
-- contigs
-- assembly graphs
-- polished contigs
-- software versions
+# Assembly
+
+## subworkflows/local/assemble_reads
+
+>**NOTE:**
+>Hybrid assembly of long and short reads uses a different workflow that can be found [here](/subworkflows/hybrid_assembly)
+
+## Steps
+
+1. **Assembly** proceeds differently depending whether paired-end short or long reads. If the samples are marked as metagenomic, then metagenomic assembly flags will be added to the corresponding assembler.
+  - **Paired end assembly** is performed using [Spades](https://github.com/ablab/spades) (for more information see the module [spades_assemble.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/spades_assemble.nf))
+  - **Long read assembly** is performed using [Flye](https://github.com/fenderglass/Flye) (for more information see the module [flye_assemble.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/flye_assemble.nf)
+
+2. **Bandage plots** are generated using [Bandage](https://rrwick.github.io/Bandage/), these images were included as they can be informative of assembly quality in some situations [bandage_image.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/bandage_image.nf).
+
+3. **Polishing** (OPTIONAL) can be performed on either short or long/hybrid assemblies. [Minimap2](https://github.com/lh3/minimap2) is used to create a contig index [minimap2_index.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/minimap2_index.nf) and then maps reads to that index [minimap2_map.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/minimap2_map.nf). Lastly, [Racon](https://github.com/isovic/racon) uses this output to perform contig polishing [racon_polish.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/racon_polish.nf). To turn off polishing add the following to your command line parameters `--skip_polishing`.
+
+## Input
+- cleaned reads
+- metadata
+
+## Outputs
+- contigs
+- assembly graphs
+- polished contigs
+- software versions
diff --git a/docs/subworkflows/bin_contigs.md b/docs/subworkflows/bin_contigs.md
@@ -1,12 +1,18 @@
-# Bin Contigs
-
-## subworkflows/local/split_metagenomic.nf
-## Steps
-
-1. **Kraken2** is run to generate output reports and separate classified contigs from unclassified.
-2. **A Python script** is run that separates each classified group of contigs into separate files at a specified taxonomic level (the default level is genus). Quite a few outputs can be generated from the process as each file is each file id is updated to be labeled as {Sample Name}_{Genus}
-
-## Input
-- contigs, reads and meta data
-## Outputs
-- metadata, binned contigs
+# Bin Contigs
+
+## subworkflows/local/split_metagenomic.nf
+## Steps
+
+1. **[Kraken2](https://github.com/DerrickWood/kraken2/wiki)** is run to generate output reports and separate classified contigs from unclassified.
+2. **[A custom script](https://github.com/phac-nml/mikrokondo/blob/main/bin/kraken2_bin.py)** separates each classified group of contigs into separate files at a specified taxonomic level (default level: genus). Output files are labeled as `[Sample Name]_[Genus]` to allow for easy post processing.
+
+## Input
+
+- contigs
+- reads
+- metadata
+
+## Outputs
+
+- metadata
+- binned contigs
diff --git a/docs/subworkflows/clean_reads.md b/docs/subworkflows/clean_reads.md
@@ -1,28 +1,32 @@
-# Read Quality Control
-
-## subworkflows/local/clean_reads
-
-## Steps
-1. **Reads are decontaminated** using **minimap2**, against an 'sequencing off-target' index. This index contains:
-	- Reads associated with Humans (de-hosting)
-	- Known sequencing controls (phiX)
-2. **FastQC** is run on reads to create summary outputs, **FastQC may not be retained** in later versions of MikroKondo.
-3. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp)
-	- Currently no adapters are specified within FastP when it is run and auto-detection is used.
-	- FastP parameters can be altered within the nextflow.config file. <!-- ADD LINK TO CHANGING PARAMETERS PAGE -->
-	- Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. **Chopper** is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request.
-4. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output.
-5. **Read downsampling** (OPTIONAL) if toggled on, an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line.
-	- Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash)
-	- Total basepairs are taken from [FastP](https://github.com/OpenGene/fastp)
-	- Read downsampling is then performed using [Seqtk](https://github.com/lh3/seqtk)
-6. **Metagenomic assesment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, the workflow will assess how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblis later on and contigs will be binned at a defined taxonomic level (default is genus level).
-7. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so it can be easily skipped by enabling the `--skip_ont_header_cleaning true` option from the command line.
-
-## Input
-- reads and metadata
-
-## Outputs
-- quality trimmed and deconned reads
-- estimated genome size
-- software versions
+# Read Quality Control
+
+## subworkflows/local/clean_reads
+
+## Steps
+1. **Reads are decontaminated** using [minimap2](https://github.com/lh3/minimap2), against a 'sequencing off-target' index. This index contains:
+	- Reads associated with Humans (de-hosting)
+	- Known sequencing controls (phiX)
+
+2. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp)
+	- Currently no adapters are specified within FastP when it is run and auto-detection is used.
+	- FastP parameters can be altered within the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) file.
+	- Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. [Chopper](https://github.com/wdecoster/chopper) is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request.
+
+3. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output.
+
+4. **Read downsampling** (OPTIONAL) an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line.
+	- Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash)
+	- Total basepairs are taken from [FastP](https://github.com/OpenGene/fastp)
+	- Read downsampling is then performed using [Seqtk](https://github.com/lh3/seqtk)
+
+5. **Metagenomic assesment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, this step assesses how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblies and contigs will be binned at a defined taxonomic level (default level: genus).
+
+6. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so default setting is `--skip_ont_header_cleaning true`.
+
+## Input
+- reads and metadata
+
+## Outputs
+- quality trimmed and deconned reads
+- estimated genome size
+- software versions
diff --git a/docs/subworkflows/determine_species.md b/docs/subworkflows/determine_species.md
@@ -1,16 +1,17 @@
-# Determine Species
-
-## subworkflows/local/determine_species
-
-## Steps
-1. **Taxonomic classification** is completed using [Mash](https://github.com/marbl/Mash) (DEFAULT), (mash_screen.nf), or [Kraken2](https://github.com/DerrickWood/kraken2) (OPTIONAL, or when samples are flagged metagenomic), (kraken.nf). Species classification and subsequent subtyping can be skipped by passing `--skip_species_classification true` on the command line. To select Kraken2 for speciation rather than mash you can add `--run_kraken true` to your command line arguments.
-
->NOTE:
->If species specific subtyping tools are to be executed by the pipeline, **Mash must be the chosen classifier**
-
-## Input
-- metadata contigs <!-- isn't it reads? Or do you input fasta? -->
-
-## Output
-- Mash/Kraken2 report
-- software versions
+# Determine Species
+
+## subworkflows/local/determine_species
+
+## Steps
+1. **Taxonomic classification** is completed using [Mash](https://github.com/marbl/Mash) (DEFAULT), [mash_screen.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/mash_screen.nf), or [Kraken2](https://github.com/DerrickWood/kraken2) (OPTIONAL, or when samples are flagged metagenomic), [kraken.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/kraken.nf). Species classification and subsequent subtyping can be skipped by passing `--skip_species_classification true` on the command line. To select Kraken2 for speciation rather than mash you add `--run_kraken true` to your command line arguments.
+
+>NOTE:
+>If species specific subtyping tools are to be executed by the pipeline, **Mash must be the chosen classifier**
+
+## Input
+- metadata
+- assembled contigs
+
+## Output
+- Mash/Kraken2 report
+- software versions
diff --git a/docs/subworkflows/genomes_annotate.md b/docs/subworkflows/genomes_annotate.md
@@ -0,0 +1,37 @@
+# Genome Annotation
+
+## subworflows/local/annotate_genomes
+
+## Steps
+1. **Genome annotation** is performed using [Bakta](https://github.com/oschwengers/bakta) within [bakta_annotate.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/bakta_annotate.nf)
+
+    - You must download a Bakta database and add its path to the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) file or add its path as a command line option
+    - To skip running Bakta add `--skip_bakta true` to your command line options.
+
+2. **Screening for antimicrobial resistance** [Abricate](https://github.com/tseemann/abricate) is used with the default options and default database, however you can specify a database by updating the `args` in the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) for Abricate. 
+
+    - You can skip running Abricate by adding `--skip_abricate true` to your command line options.
+
+3. **Screening for plasmids** is performed using [Mob-suite](https://github.com/phac-nml/mob-suite) with default options.
+
+4. **Selection of Pointfindr database**. This step is only ran if running [StarAMR](https://github.com/phac-nml/staramr). It will try and select the correct database based on the species identified earlier in the pipeline. If a database cannot be determined pointfinder will simply not be run.
+
+5. **Exporting of StarAMR databases used**. To provide a method of user validation for automatic database selection, the database info from StarAMR will be exported from the pipeline into the file `StarAMRDBVersions.txt` and placed in the StarAMR directory.
+
+6. **Screening for antimicrobial resistance** with **StarAMR**. [StarAMR](https://github.com/phac-nml/staramr) is provided as an additional option to screen for antimicrobial resistance in ResFinder, PointFinder and PlasmidFinder databases. Passing in a database is optional as the one within the container will be used by default.
+    - You can skip running StarAMR by adding the following flag `--skip_starmar`
+
+>NOTE:
+>A custom database for Bakta can be downloaded via the commandline using `bakta_download_db.nf`.
+>The `bakta_db` setting can be changed in the `nextflow.config` file, see [bakta](/usage/tool_params/#bakta)
+
+## Input
+- contigs
+- metadata
+
+## Output
+- Bakta outputs
+- abricate outputs
+- mob-suite outputs
+- starAMR outputs
+- software versions