From 0cfa940d8a03d76aa33bf5fc9e437b754a59319e Mon Sep 17 00:00:00 2001 From: ChristyPeterson Date: Tue, 13 Feb 2024 09:34:55 -0600 Subject: [PATCH] updated docs --- docs/images/20230630_Mikrokondo-logo_v4.svg | 182 +-- docs/index.md | 31 +- docs/subworkflows/annotate_genomes.md | 22 - docs/subworkflows/assemble_reads.md | 51 +- docs/subworkflows/bin_contigs.md | 30 +- docs/subworkflows/clean_reads.md | 60 +- docs/subworkflows/determine_species.md | 33 +- docs/subworkflows/genomes_annotate.md | 37 + docs/subworkflows/hybrid_assembly.md | 44 +- docs/subworkflows/input_check.md | 32 +- docs/subworkflows/polish_assemblies.md | 35 +- docs/subworkflows/qc_assembly.md | 45 +- docs/subworkflows/subtype_genome.md | 35 +- docs/troubleshooting/FAQ.md | 159 ++- docs/usage/Utilities.md | 18 +- docs/usage/configuration.md | 1131 +++---------------- docs/usage/examples.md | 66 +- docs/usage/installation.md | 124 +- docs/usage/tool_params.md | 466 ++++++++ docs/usage/useage.md | 120 ++ docs/workflows/CleanAssemble.md | 77 +- docs/workflows/PostAssembly.md | 61 +- mkdocs.yml | 7 +- 23 files changed, 1373 insertions(+), 1493 deletions(-) delete mode 100644 docs/subworkflows/annotate_genomes.md create mode 100644 docs/subworkflows/genomes_annotate.md create mode 100644 docs/usage/tool_params.md create mode 100644 docs/usage/useage.md diff --git a/docs/images/20230630_Mikrokondo-logo_v4.svg b/docs/images/20230630_Mikrokondo-logo_v4.svg index 5ae503dc..21db5281 100644 --- a/docs/images/20230630_Mikrokondo-logo_v4.svg +++ b/docs/images/20230630_Mikrokondo-logo_v4.svg @@ -1,91 +1,91 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - kond - - - - - o - - - - - - - - - - - mikro - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + kond + + + + + o + + + + + + + + + + + mikro + + + + + + + diff --git a/docs/index.md b/docs/index.md index 0c9efa70..30a49434 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,13 +1,18 @@ - -![Pipeline](images/20230630_Mikrokondo-logo_v4.svg "Logo") -# Welcome to mikrokondo! - -## What is mikrokondo? -Mikrokondo is a tidy workflow for performing routine bioinformatic tasks like, read pre-processing, assessing contamination, assembly and quality assessment of assemblies. It is easily configurable, provides dynamic dispatch of species specific workflows and produces common outputs. - -## Is mikrokondo right for me? -Mikrokondo takes in either, Illumina, Nanopore or Pacbio data (Pacbio data only partially tested). You can also use mikrokondo for hybrid assemblies or even pass it pre-assembled assembled genomes. Additionally, mikrokondo required minimal upfront knowledge of your sample. - -## Workflow Schematics (Subject to change) - -![Pipeline](images/20230921_Mikrokondo-worflow2.png "Workflow") + +![Pipeline](images/20230630_Mikrokondo-logo_v4.svg "Logo") +# Welcome to mikrokondo! + +## What is mikrokondo? +Mikrokondo is a tidy workflow for performing routine bioinformatic assessment of sequencing reads and assemblies, such as: read pre-processing, assessing contamination, assembly, quality assessment of assemblies, and pathogen-specific typing. It is easily configurable, provides dynamic dispatch of species specific workflows and produces common outputs. + +## What is the target audience? +This workflow can be used in sequencing and reference laboratories as a part of an automated quality and initial bioinformatics assessment protocol. + +## Is mikrokondo right for me? +Mikrokondo is purpose built to provide sequencing and clinical laboratories with an all encompassing workflow to provide a standardized workflow that can provide the initial quality assessment of sequencing reads and assemblies, and initial pathogen-specific typing. It has been designed to be configurable so that new tools and quality metrics can be easily incorporated into the workflow to allow for automation of these routine tasks regardless of pathogen of interest. It currently accepts Illumina, Nanopore or Pacbio (Pacbio data only partially tested) sequencing data. It is capable of hybrid assembly or accepting pre-assembled genomes. + +This workflow will detect what pathogen(s) is present and apply the applicable metrics and genotypic typing where appropriate, generating easy to read and understand reports. If your group is regularly sequencing or analyzing genomic sequences, implementation of this workflow will automate the hands-on time time usually required for these common bioinformatic tasks. + +## Workflow Schematics (Subject to change) + +![Pipeline](images/20230921_Mikrokondo-worflow2.png "Workflow") diff --git a/docs/subworkflows/annotate_genomes.md b/docs/subworkflows/annotate_genomes.md deleted file mode 100644 index 6e5d05f3..00000000 --- a/docs/subworkflows/annotate_genomes.md +++ /dev/null @@ -1,22 +0,0 @@ -# Genome Annotation - -## subworflows/local/annotate_genomes - -## Steps -1. **Genome annotation** is performed using [Bakta](https://github.com/oschwengers/bakta) [Bakta](bakta_annotate.nf), you must download a Bakta database and add its path to the `nextflow.config` file or add its path as a command line option. To skip running Bakta add `--skip_bakta true` to your command line options. -2. **Screening for antimicrobial resistance** with **Abricate**. [Abricate](https://github.com/tseemann/abricate) is used with the default options and default database, however you can specify a database by updating the `args` in the `nextflow.config` for Abricate. You can also skip running Abricate by adding `--skip_abricate true` to your command line options. -3. **Screening for plasmids** is performed using [Mob-suite](https://github.com/phac-nml/mob-suite) with default options. -2. **Selection of Pointfindr database**. This step is only ran if running [StarAMR](https://github.com/phac-nml/staramr). It will try and select the correct database based on the species identified earlier in the pipeline. If a database cannot be determined pointfinder will simply not be run. -3. **Exporting of StarAMR databases used**. The database info from StarAMR will be exported from the pipeline into a file and copied into the StarAMR directory **so that you can** validate the correct database has been used. -4. **Screening for antimicrobial resistance** with **StarAMR**. [StarAMR](https://github.com/phac-nml/staramr) is provided as an additional option to screen for antimicrobial resistance in ResFinder, PointFinder and PlasmidFinder databases. Passing in a database is optional as the one within the container will be used by default. - ->NOTE: ->A custom database for Bakta can be downloaded via the commandline using `bakta_download_db.nf`. ->The `bakta_db` setting can be changed in the `nextflow.config` file, see 'Changing Pipeline settings' - -## Input -- contigs and metadata - -## Output -- All associated Bakta outputs -- software versions diff --git a/docs/subworkflows/assemble_reads.md b/docs/subworkflows/assemble_reads.md index a4c1adc7..9ef62e83 100644 --- a/docs/subworkflows/assemble_reads.md +++ b/docs/subworkflows/assemble_reads.md @@ -1,25 +1,26 @@ -# Assembly - -## subworkflows/local/assemble_reads - -## Steps - -1. **Assembly** proceeds differently depending whether short paired-end or long reads. **If the samples are marked as metagenomic, then metagenomic assembly flags will be added** to the corresponding assembler. - - **Paired end assembly** is performed using [Spades](https://github.com/ablab/spades) (spades_assemble.nf) - - **Long read assembly** is performed using [Flye](https://github.com/fenderglass/Flye) (flye_assemble.nf) - -2. **Bandage plots** are generated using [Bandage](https://rrwick.github.io/Bandage/), these may not be useful for every user, but the can be informative of assembly quality in some situations (bandage_image.nf). - ->NOTE: ->Hybrid assembly of long and short reads uses a different workflow that can be found [here](hybrid_assembly.md) - -3. **Polishing** (OPTIONAL) can be performed on either short or long/hybrid assemblies. [Minimap2](https://github.com/lh3/minimap2) is used to create a contig index (minimap2_index.nf) and then maps reads to that index (minimap2_map.nf). Lastly, [Racon](https://github.com/isovic/racon) uses this output to perform contig polishing (racon_polish.nf). To turn off polishing add the following to your command line parameters `--skip_polishing`. - -## Input -- cleaned reads and metadata - -## Outputs -- contigs -- assembly graphs -- polished contigs -- software versions +# Assembly + +## subworkflows/local/assemble_reads + +>**NOTE:** +>Hybrid assembly of long and short reads uses a different workflow that can be found [here](/subworkflows/hybrid_assembly) + +## Steps + +1. **Assembly** proceeds differently depending whether paired-end short or long reads. If the samples are marked as metagenomic, then metagenomic assembly flags will be added to the corresponding assembler. + - **Paired end assembly** is performed using [Spades](https://github.com/ablab/spades) (for more information see the module [spades_assemble.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/spades_assemble.nf)) + - **Long read assembly** is performed using [Flye](https://github.com/fenderglass/Flye) (for more information see the module [flye_assemble.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/flye_assemble.nf) + +2. **Bandage plots** are generated using [Bandage](https://rrwick.github.io/Bandage/), these images were included as they can be informative of assembly quality in some situations [bandage_image.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/bandage_image.nf). + +3. **Polishing** (OPTIONAL) can be performed on either short or long/hybrid assemblies. [Minimap2](https://github.com/lh3/minimap2) is used to create a contig index [minimap2_index.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/minimap2_index.nf) and then maps reads to that index [minimap2_map.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/minimap2_map.nf). Lastly, [Racon](https://github.com/isovic/racon) uses this output to perform contig polishing [racon_polish.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/racon_polish.nf). To turn off polishing add the following to your command line parameters `--skip_polishing`. + +## Input +- cleaned reads +- metadata + +## Outputs +- contigs +- assembly graphs +- polished contigs +- software versions diff --git a/docs/subworkflows/bin_contigs.md b/docs/subworkflows/bin_contigs.md index 0000a399..a8149c59 100644 --- a/docs/subworkflows/bin_contigs.md +++ b/docs/subworkflows/bin_contigs.md @@ -1,12 +1,18 @@ -# Bin Contigs - -## subworkflows/local/split_metagenomic.nf -## Steps - -1. **Kraken2** is run to generate output reports and separate classified contigs from unclassified. -2. **A Python script** is run that separates each classified group of contigs into separate files at a specified taxonomic level (the default level is genus). Quite a few outputs can be generated from the process as each file is each file id is updated to be labeled as {Sample Name}_{Genus} - -## Input -- contigs, reads and meta data -## Outputs -- metadata, binned contigs +# Bin Contigs + +## subworkflows/local/split_metagenomic.nf +## Steps + +1. **[Kraken2](https://github.com/DerrickWood/kraken2/wiki)** is run to generate output reports and separate classified contigs from unclassified. +2. **[A custom script](https://github.com/phac-nml/mikrokondo/blob/main/bin/kraken2_bin.py)** separates each classified group of contigs into separate files at a specified taxonomic level (default level: genus). Output files are labeled as `[Sample Name]_[Genus]` to allow for easy post processing. + +## Input + +- contigs +- reads +- metadata + +## Outputs + +- metadata +- binned contigs diff --git a/docs/subworkflows/clean_reads.md b/docs/subworkflows/clean_reads.md index f21d1f67..13403528 100644 --- a/docs/subworkflows/clean_reads.md +++ b/docs/subworkflows/clean_reads.md @@ -1,28 +1,32 @@ -# Read Quality Control - -## subworkflows/local/clean_reads - -## Steps -1. **Reads are decontaminated** using **minimap2**, against an 'sequencing off-target' index. This index contains: - - Reads associated with Humans (de-hosting) - - Known sequencing controls (phiX) -2. **FastQC** is run on reads to create summary outputs, **FastQC may not be retained** in later versions of MikroKondo. -3. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp) - - Currently no adapters are specified within FastP when it is run and auto-detection is used. - - FastP parameters can be altered within the nextflow.config file. - - Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. **Chopper** is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request. -4. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output. -5. **Read downsampling** (OPTIONAL) if toggled on, an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line. - - Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash) - - Total basepairs are taken from [FastP](https://github.com/OpenGene/fastp) - - Read downsampling is then performed using [Seqtk](https://github.com/lh3/seqtk) -6. **Metagenomic assesment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, the workflow will assess how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblis later on and contigs will be binned at a defined taxonomic level (default is genus level). -7. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so it can be easily skipped by enabling the `--skip_ont_header_cleaning true` option from the command line. - -## Input -- reads and metadata - -## Outputs -- quality trimmed and deconned reads -- estimated genome size -- software versions +# Read Quality Control + +## subworkflows/local/clean_reads + +## Steps +1. **Reads are decontaminated** using [minimap2](https://github.com/lh3/minimap2), against a 'sequencing off-target' index. This index contains: + - Reads associated with Humans (de-hosting) + - Known sequencing controls (phiX) + +2. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp) + - Currently no adapters are specified within FastP when it is run and auto-detection is used. + - FastP parameters can be altered within the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) file. + - Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. [Chopper](https://github.com/wdecoster/chopper) is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request. + +3. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output. + +4. **Read downsampling** (OPTIONAL) an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line. + - Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash) + - Total basepairs are taken from [FastP](https://github.com/OpenGene/fastp) + - Read downsampling is then performed using [Seqtk](https://github.com/lh3/seqtk) + +5. **Metagenomic assesment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, this step assesses how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblies and contigs will be binned at a defined taxonomic level (default level: genus). + +6. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so default setting is `--skip_ont_header_cleaning true`. + +## Input +- reads and metadata + +## Outputs +- quality trimmed and deconned reads +- estimated genome size +- software versions diff --git a/docs/subworkflows/determine_species.md b/docs/subworkflows/determine_species.md index 06c64ad5..c77ebfda 100644 --- a/docs/subworkflows/determine_species.md +++ b/docs/subworkflows/determine_species.md @@ -1,16 +1,17 @@ -# Determine Species - -## subworkflows/local/determine_species - -## Steps -1. **Taxonomic classification** is completed using [Mash](https://github.com/marbl/Mash) (DEFAULT), (mash_screen.nf), or [Kraken2](https://github.com/DerrickWood/kraken2) (OPTIONAL, or when samples are flagged metagenomic), (kraken.nf). Species classification and subsequent subtyping can be skipped by passing `--skip_species_classification true` on the command line. To select Kraken2 for speciation rather than mash you can add `--run_kraken true` to your command line arguments. - ->NOTE: ->If species specific subtyping tools are to be executed by the pipeline, **Mash must be the chosen classifier** - -## Input -- metadata contigs - -## Output -- Mash/Kraken2 report -- software versions +# Determine Species + +## subworkflows/local/determine_species + +## Steps +1. **Taxonomic classification** is completed using [Mash](https://github.com/marbl/Mash) (DEFAULT), [mash_screen.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/mash_screen.nf), or [Kraken2](https://github.com/DerrickWood/kraken2) (OPTIONAL, or when samples are flagged metagenomic), [kraken.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/kraken.nf). Species classification and subsequent subtyping can be skipped by passing `--skip_species_classification true` on the command line. To select Kraken2 for speciation rather than mash you add `--run_kraken true` to your command line arguments. + +>NOTE: +>If species specific subtyping tools are to be executed by the pipeline, **Mash must be the chosen classifier** + +## Input +- metadata +- assembled contigs + +## Output +- Mash/Kraken2 report +- software versions diff --git a/docs/subworkflows/genomes_annotate.md b/docs/subworkflows/genomes_annotate.md new file mode 100644 index 00000000..1ba31ca8 --- /dev/null +++ b/docs/subworkflows/genomes_annotate.md @@ -0,0 +1,37 @@ +# Genome Annotation + +## subworflows/local/annotate_genomes + +## Steps +1. **Genome annotation** is performed using [Bakta](https://github.com/oschwengers/bakta) within [bakta_annotate.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/bakta_annotate.nf) + + - You must download a Bakta database and add its path to the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) file or add its path as a command line option + - To skip running Bakta add `--skip_bakta true` to your command line options. + +2. **Screening for antimicrobial resistance** [Abricate](https://github.com/tseemann/abricate) is used with the default options and default database, however you can specify a database by updating the `args` in the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) for Abricate. + + - You can skip running Abricate by adding `--skip_abricate true` to your command line options. + +3. **Screening for plasmids** is performed using [Mob-suite](https://github.com/phac-nml/mob-suite) with default options. + +4. **Selection of Pointfindr database**. This step is only ran if running [StarAMR](https://github.com/phac-nml/staramr). It will try and select the correct database based on the species identified earlier in the pipeline. If a database cannot be determined pointfinder will simply not be run. + +5. **Exporting of StarAMR databases used**. To provide a method of user validation for automatic database selection, the database info from StarAMR will be exported from the pipeline into the file `StarAMRDBVersions.txt` and placed in the StarAMR directory. + +6. **Screening for antimicrobial resistance** with **StarAMR**. [StarAMR](https://github.com/phac-nml/staramr) is provided as an additional option to screen for antimicrobial resistance in ResFinder, PointFinder and PlasmidFinder databases. Passing in a database is optional as the one within the container will be used by default. + - You can skip running StarAMR by adding the following flag `--skip_starmar` + +>NOTE: +>A custom database for Bakta can be downloaded via the commandline using `bakta_download_db.nf`. +>The `bakta_db` setting can be changed in the `nextflow.config` file, see [bakta](/usage/tool_params/#bakta) + +## Input +- contigs +- metadata + +## Output +- Bakta outputs +- abricate outputs +- mob-suite outputs +- starAMR outputs +- software versions diff --git a/docs/subworkflows/hybrid_assembly.md b/docs/subworkflows/hybrid_assembly.md index 94068461..e8c7c8a1 100644 --- a/docs/subworkflows/hybrid_assembly.md +++ b/docs/subworkflows/hybrid_assembly.md @@ -1,21 +1,23 @@ -# Hybrid Assembly - -## subworkflows/local/hybrid_assembly - -## Choice of 2 workflows -1. **DEFAULT** - A. [Flye](https://github.com/fenderglass/Flye) assembly (flye_assembly.nf) - B. [Bandage](https://rrwick.github.io/Bandage/) creates a bandage plot of the assembly (bandage_image.nf) - C. [Minimap2](https://github.com/lh3/minimap2) creates an index of the contigs (minimap2_index.nf), then maps long reads to this index (minimap2_map.nf) - D. [Racon](https://github.com/isovic/racon) uses the short reads to iteratively polish contigs (pilon_iter.nf) -2. **OPTIONAL** - A. [Unicycler](https://github.com/rrwick/Unicycler) assembly (unicycler_assemble.nf) - B. [Bandage](https://rrwick.github.io/Bandage/) creates a bandage plot of the assembly (bandage_image.nf) - -## Input -- metadata, short reads and long reads - -## Output -- contigs (pilon, unicycler) -- vcf data (pilon) -- software versions +# Hybrid Assembly + +## subworkflows/local/hybrid_assembly + +## Choice of 2 workflows +1. **DEFAULT** + A. [Flye](https://github.com/fenderglass/Flye) assembly [flye_assembly.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/flye_assemble.nf) + B. [Bandage](https://rrwick.github.io/Bandage/) creates a bandage plot of the assembly [bandage_image.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/bandage_image.nf) + C. [Minimap2](https://github.com/lh3/minimap2) creates an index of the contigs (minimap2_index.nf), then maps long reads to this index [minimap2_map.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/minimap2_map.nf) + D. [Racon](https://github.com/isovic/racon) uses the short reads to iteratively polish contigs [pilon_iter.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/pilon_polisher.nf) +2. **OPTIONAL** + A. [Unicycler](https://github.com/rrwick/Unicycler) assembly [unicycler_assemble.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/unicycler_assemble.nf) + B. [Bandage](https://rrwick.github.io/Bandage/) creates a bandage plot of the assembly [bandage_image.nf](https://github.com/phac-nml/mikrokondo/blob/main/modules/local/bandage_image.nf) + +## Input +- metadata +- short reads +- long reads + +## Output +- contigs (pilon, unicycler) +- vcf data (pilon) +- software versions diff --git a/docs/subworkflows/input_check.md b/docs/subworkflows/input_check.md index 6de6342f..3e607c37 100644 --- a/docs/subworkflows/input_check.md +++ b/docs/subworkflows/input_check.md @@ -1,15 +1,17 @@ -# Input Verification - -## subworkflows/local/input_check.nf - - -## Steps -1. Intake Sample sheet CSV and group samples with same ID. Sample metadata specific to the pipeline is added. A metadata field will additionally be created for samples containing the read data and sample information such as the samples name, and if the sample contains paired reads (Illumina) or long reads (Nanopore or Pacbio). -2. If there are samples that contain duplicate ID's the **samples will be combined**. - - -## Input -- CSV formatted sample sheet - -## Outputs -- A channel of reads and their associated metadata +# Input Verification + +## subworkflows/local/input_check.nf + + +## Steps +1. Reads in the sample sheet and groups samples with shared IDs. + +2. Pipeline specific tags are added to each sample. + +3. If there are samples that have duplicate ID's the **samples will be combined**. + +## Input +- CSV formatted sample sheet + +## Outputs +- A channel of reads and their associated tags diff --git a/docs/subworkflows/polish_assemblies.md b/docs/subworkflows/polish_assemblies.md index 7905bc48..c0e1f714 100644 --- a/docs/subworkflows/polish_assemblies.md +++ b/docs/subworkflows/polish_assemblies.md @@ -1,16 +1,19 @@ -# Assembly Polishing - -## subworkflows/local/polish_assemblies - -## Steps -1. Final polishing proceeds differently depending on whether the sample is Illumina or Pacbio. - - **Illumina** A custom script is implemented which iteratively polishes assemblies with reads based on a set amount of iterations specified by the user. polishing uses Pilon and minimap2, with reads being mapped back to the polished assembly each time. - - **Nanopore** Medaka consensus is used to polish reads, a model must be specified by the user for polishing - - **Pacbio** No addtional polishing is performed, outputs of Pacbio data still need to be tested. - -## Input -- cleaned reads -- Assembly - -## Outputs -- Polished assemblies and the reads used to polish them +# Assembly Polishing + +## subworkflows/local/polish_assemblies + +## Steps +1. Final polishing proceeds differently depending on whether the sample is Illumina or Pacbio. + + - **Illumina** A custom script is implemented to iteratively polish assemblies with reads based on a set number of iterations (DEFAULT 3). + - Polishing uses [Pilon](https://github.com/broadinstitute/pilon) and [minimap2](https://github.com/lh3/minimap2), with reads being mapped back to the polished assembly each time. + - **Nanopore** [Medaka](https://github.com/nanoporetech/medaka) consensus is used to polish reads, a model must be specified by the user for polishing. + - **Pacbio** No addtional polishing is performed, outputs of Pacbio data still need to be tested. + +## Input +- cleaned reads +- Assembly + +## Outputs +- Polished assemblies +- Reads used to polish diff --git a/docs/subworkflows/qc_assembly.md b/docs/subworkflows/qc_assembly.md index a58d0706..7bea75bf 100644 --- a/docs/subworkflows/qc_assembly.md +++ b/docs/subworkflows/qc_assembly.md @@ -1,18 +1,27 @@ -# Assembly Quality Control - -## subworkflows/local/qc_assembly - -## Steps -1. **Generate assembly quality metrics** using **QUAST**. QUAST is used to generate summary assembly metrics such as: N50 value, number of contigs,average depth of coverage and genome size. -2. **Assembly filtering** a script implemented using the nextflow DSL (Groovy) then filters assemblies that meet quality thresholds, so that only assemblies meeting some given set of criteria are used in down stream processing. -3. **Contamination detection** using CheckM, CheckM is run to identify a percent contamination score and build up evidence for signs of contamination in a sample. CheckM can be skipped by adding `--skip_checkm` to you command-line options as the data it generates may not be needed, and it can have a long run time. -4. **Classic seven gene MLST** using **mlst**. (mlst)[https://github.com/tseemann/mlst] is run and its outputs are contained within the final report. This step can be skipped by adding `--skip_mlst` to the commmand line options. - - -## Input -- cleaned reads and metadata -- polished contigs and metadata - -## Outputs -- filtered contigs -- software versions +# Assembly Quality Control + +## subworkflows/local/qc_assembly + +## Steps +1. **Generate assembly quality metrics** [QUAST](https://github.com/ablab/quast) is used to generate summary assembly metrics such as: N50 value, number of contigs,average depth of coverage and genome size. + +2. **Assembly filtering** Using a custom nexflow DSL (Groovy)script, assemblies are filtered to meet quality thresholds. + + - See [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) in the `quast_filter` section to see what defaults are currently implemented, or to set your own. + +3. **Contamination detection** [CheckM](https://github.com/Ecogenomics/CheckM) is run to identify a percent contamination score and build up evidence for signs of contamination in a sample. + + - CheckM can be skipped by adding `--skip_checkm` to the command-line options as the data it generates may not be needed, and it can have a long run time. + +4. **Classic seven gene MLST** [mlst](https://github.com/tseemann/mlst) is run and its outputs are contained within the final report. + + - This step can be skipped by adding `--skip_mlst` to the commmand line options. + + +## Input +- cleaned reads with tags +- polished contigs with tags + +## Outputs +- filtered contigs +- software versions diff --git a/docs/subworkflows/subtype_genome.md b/docs/subworkflows/subtype_genome.md index 7b97c1b4..118a7f1a 100644 --- a/docs/subworkflows/subtype_genome.md +++ b/docs/subworkflows/subtype_genome.md @@ -1,16 +1,19 @@ -# Genome Sub-typing - -## subworkflows/local/subtype_genome - -## Steps -1. **Species specific subtyping** tools are launched requiring the pipelines outputted **Mash** screen report. Currently subtyping tools for *E.coli*, *Salmonella*, *Listeria spp.*, *Staphylococcus spp.*, *Klebsiella spp.* and *Shigella spp.* are supported. Subtyping can be disabled from the command line by passing `--skip_subtyping true` on the command line. - -## Note of importance -If a sample cannot be subtyped, it merely passes through the pipeline and is not typed. A log message will instead be displayed notifying the user the sample cannot be typed however. - -## Input -- contigs and meta data -- Mash report - -## Output -- software versions +# Genome Sub-typing + +## subworkflows/local/subtype_genome + +## Steps +1. **Species specific subtyping** tools are launched according to the pipeline's **Mash** screen report. + + - Currently subtyping tools for *E.coli*, *Salmonella*, *Listeria spp.*, *Staphylococcus spp.*, *Klebsiella spp.* and *Shigella spp.* are supported. + - Subtyping can be disabled from the command line by passing `--skip_subtyping true` on the command line. + +> **NOTE** +> If a sample cannot be subtyped, it merely passes through the pipeline and is not typed. A log message will instead be displayed notifying the user the sample cannot be typed. + +## Input +- contigs and associated tags +- Mash report + +## Output +- software versions diff --git a/docs/troubleshooting/FAQ.md b/docs/troubleshooting/FAQ.md index c23b1a99..3003cb6a 100644 --- a/docs/troubleshooting/FAQ.md +++ b/docs/troubleshooting/FAQ.md @@ -1,80 +1,79 @@ -# FAQ - -## How is variable type determined from command line parameters? - -This may be a weird thing to but in the docs, but if you are developing the pipeline or somehow finding that a parameter passed on the command line is not working properly. For example you want a sample to have at least 1000 reads before going for assembly (`--min_reads 1000`) and samples with only one read are being assembled this may the source of your issue. - -The way a variable type is determined from the command line can be found in the following [groovy code](https://github.com/nextflow-io/nextflow/blob/8c0566fc3a35c8d3a4e01a508a0667e471bab297/modules/nextflow/src/main/groovy/nextflow/cli/CmdRun.groovy#L506-L518). The snippet is also pasted below and is up to date as of 2023-10-16: - -``` - static protected parseParamValue(String str ) { - - if ( str == null ) return null - - if ( str.toLowerCase() == 'true') return Boolean.TRUE - if ( str.toLowerCase() == 'false' ) return Boolean.FALSE - - if ( str==~/\d+(\.\d+)?/ && str.isInteger() ) return str.toInteger() - if ( str==~/\d+(\.\d+)?/ && str.isLong() ) return str.toLong() - if ( str==~/\d+(\.\d+)?/ && str.isDouble() ) return str.toDouble() - - return str - } -``` - -## Common errors and how to (maybe) fix them - -### Troubleshooting - -Common errors and potential fixes for modules will be detailed here. - -### null errors, or report generation failing on line 701 - -Currently there seems to be some compatibility issues between version 22 of nextflow and version 23.10.0 with regards to parsing the `nextflow.config` file. I am currently working on addressing them now. if you happen to encounter issues please downgrade your nextflow install to 22.10.1 - -### Permission denied on a python script (`bin/some_script.py`) - -There may be an issue on certain installs where the python scripts included alongside mikrokondo do not work due to lack of permissions. The easiest way to solve this issue is to execute `chmod +x bin/*.py` in the mikrokondo installation directory. This will add execution permissions to all of the scripts, if this solution does not work then please submit an issue. - -### Random issues containing on resume `org.iq80.leveldb.impl.Version.retain()` - -Sometimes the resume features of Nextflow don't always work completely. The above error string typically implies that some output could not be gathered from a process and on subsequent resumes you will get an error. You can find out what process (and its work directory location) caused the error in the `nextflow.log` (normally it will be at the top of some long traceback in the log), and a work directory will be specified listing the directory causing the error. Delete this directory and resume the pipeline. **If you hate logs and you don't care about resuming** other processes you can simply delete the work directory entirely. - - -### StarAMR - -- Exit code 1, and an error involving ` stderr=FASTA-Reader: Ignoring invalid residues at position(s):` - - This is likely not a problem with your data but with your databases, following the instructions listed here: https://github.com/phac-nml/staramr/issues/200#issuecomment-1741082733 seems to have fixed the issue. - - The command to download the proper databases mentioned in the issue is listed here: `staramr db build --dir staramr_databases --resfinder-commit fa32d9a3cf0c12ec70ca4e90c45c0d590ee810bd --pointfinder-commit 8c694b9f336153e6d618b897b3b4930961521eb8 --plasmidfinder-commit c18e08c17a5988d4f075fc1171636e47546a323d` - - **Passing in a database is optional as the one within the container will be used by default.** - - If you continue to have problems with StarAMR you can skip it using `--skip_staramr` - - -### Common mash estimates - -- Mash exit code 139 or 255, you may see `org.iq80.leveldb.impl.Version.retain()` appearing on screen as well. - - This indicates a segmentation fault, due to mash failing or alternatively some resource not being available. If you see that mash has run properly in the work directory output but Nextflow is saying the process failed and the `versions.yml` file is missing you likely have encountered some resource limit on your system. A simple solution is likely to reduce the number of `maxForks` available to mash the different Mash processes in the `conf/modules.config` file. Alternatively you may need to alter the number some environment variables Nextflow e.g. `OMP_NUM_THREADS`, `USE_SIMPLE_THREADED_LEVEL3` and `OPENBLAS_NUM_THREADS`. - -### Common spades issues - -- Spades exit code 21 - - One potential cause of this issue (requires looking at the log files) is due to not enough reads being present. You can avoid samples with too few reads going to assembly by adjusting the `min_reads` parameter in the `nextflow.config`. It can also be adjusted from the command line like so `--min_reads 1000` - -- Spades exit code 245 - - This could be due to multiple issues and typically results from a segmentation fault (OS Code 11). Try increasing the amount of memory spades (`conf/base.config`) if the issue persists try using a different Spades container/ create an issue. - -### Common Kraken2 issues - -- Kraken2 exit code 2 - - It is still a good idea to look at the output logs to verify your issue as they may say something like: `kraken2: database ("./kraken2_database") does not contain necessary file taxo.k2d` despite the taxo.k2d file being present. This is potentially caused by symlink issues, and one possible fix is too to provide the absolute path to your Kraken2 database in the `nextflow.config` or from the command line `--kraken.db /PATH/TO/DB` - - -### Common Docker issues - -- Exit code 137: - - Exit code 137, likely means your docker container used to much memory. You can adjust how much memory each process gets in the `conf/base.config` file, however there may be some underlying configuration you need to perform for Docker to solve this issue. - -### CheckM fails - -- CheckM exit code 1, could not find concatenated.tree or concatentated.pplacer.json - - This is a sign that CheckM has run out of memory, make sure you are using your desired executor. You may need to adjust configuration settings. +# FAQ + +## How is variable type determined from command line parameters? + +In a situation where you are developing the pipeline or finding that the parameter passed on the command line is not working as expected, for example, example: the user wants a sample to have at least 1000 reads before going for assembly (`--min_reads 1000`) and samples with less than 1000 reads are passing onto the assembly step. + +The way a variable type is determined from the command line can be found in the below [groovy code](https://github.com/nextflow-io/nextflow/blob/8c0566fc3a35c8d3a4e01a508a0667e471bab297/modules/nextflow/src/main/groovy/nextflow/cli/CmdRun.groovy#L506-L518). The snippet is also pasted below and is up to date as of 2023-10-16: + +``` + static protected parseParamValue(String str ) { + + if ( str == null ) return null + + if ( str.toLowerCase() == 'true') return Boolean.TRUE + if ( str.toLowerCase() == 'false' ) return Boolean.FALSE + + if ( str==~/\d+(\.\d+)?/ && str.isInteger() ) return str.toInteger() + if ( str==~/\d+(\.\d+)?/ && str.isLong() ) return str.toLong() + if ( str==~/\d+(\.\d+)?/ && str.isDouble() ) return str.toDouble() + + return str + } +``` + +# Troubleshooting + +## Common errors and how to (maybe) fix them + +### null errors, or report generation failing on line 701 + +Currently there are compatibility issues between version 22 and 23.10.0 of nextflow with regards to parsing the `nextflow.config` file. I am currently working on addressing them now. if you happen to encounter issues please downgrade your nextflow install to 22.10.1. + +### Permission denied on a python script (`bin/some_script.py`) + +On some installs, a lack of permissions for python scripts are causing this error to occur. The easiest way to solve this issue is to execute `chmod +x bin/*.py` in the mikrokondo installation directory. This will add execution permissions to all of the scripts, if this solution does not work then please submit an issue. + +### Random issues containing on resume `org.iq80.leveldb.impl.Version.retain()` + +Sometimes the resume features of Nextflow don't work completely. The above error string typically implies that some output could not be gathered from a process and on subsequent resumes you will get an error. You can find out what process (and its work directory location) caused the error in the `nextflow.log` (normally it will be at the top of some long traceback in the log), and a work directory will be specified listing the directory causing the error. Delete this directory and resume the pipeline. **If you hate logs and you don't care about resuming** you can simply delete the work directory entirely. + + +### StarAMR + +- Exit code 1, and an error involving ` stderr=FASTA-Reader: Ignoring invalid residues at position(s):` + - This is likely not a problem with your data but with your databases, following the instructions listed [here](https://github.com/phac-nml/staramr/issues/200#issuecomment-1741082733) should fix the issue. + - The command to download the proper databases mentioned in the issue is listed here: + `staramr db build --dir staramr_databases --resfinder-commit fa32d9a3cf0c12ec70ca4e90c45c0d590ee810bd --pointfinder-commit 8c694b9f336153e6d618b897b3b4930961521eb8 --plasmidfinder-commit c18e08c17a5988d4f075fc1171636e47546a323d` + - **Passing in a database is optional as the one within the container will be used by default.** + - If you continue to have problems with StarAMR you can skip it using `--skip_staramr` + + +### Common mash errors + +- Mash exit code 139 or 255, you may see `org.iq80.leveldb.impl.Version.retain()` appearing on screen as well. + - This indicates a segmentation fault, due to mash failing or alternatively some resource not being available. If you see that mash has run properly in the work directory output but Nextflow is saying the process failed and the `versions.yml` file is missing you likely have encountered some resource limit on your system. A simple solution is likely to reduce the number of `maxForks` available to the different Mash processes in the `conf/modules.config` file. Alternatively you may need to alter the number in some Nextflow environment variables e.g. `OMP_NUM_THREADS`, `USE_SIMPLE_THREADED_LEVEL3` and `OPENBLAS_NUM_THREADS`. + +### Common spades issues + +- Spades exit code 21 + - One potential cause of this issue (requires looking at the log files) is due to not enough reads being present. You can avoid samples with too few reads going to assembly by adjusting the `min_reads` parameter in the `nextflow.config`. It can also be adjusted from the command line with the flag `--min_reads 1000` + +- Spades exit code 245 + - This could be due to multiple issues and typically results from a segmentation fault (OS Code 11). Try increasing the amount of memory spades is alloted ([base.config](https://github.com/phac-nml/mikrokondo/blob/main/conf/base.config)) if the issue persists try using a different Spades container/ create an issue. + +### Common Kraken2 issues + +- Kraken2 exit code 2 + - It is still a good idea to look at the output logs to verify your issue as they may say something like: `kraken2: database ("./kraken2_database") does not contain necessary file taxo.k2d` despite the taxo.k2d file being present. This is potentially caused by symlink issues, and one possible fix is to provide the absolute path to your Kraken2 database in the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) or from the command line `--kraken.db /PATH/TO/DB` + + +### Common Docker issues + +- Exit code 137: + - Exit code 137, likely means your docker container used to much memory. You can adjust how much memory each process gets in the [base.config](https://github.com/phac-nml/mikrokondo/blob/main/conf/base.config) file, however there may be some underlying configuration you need to perform for Docker to solve this issue. + +### CheckM fails + +- CheckM exit code 1, could not find concatenated.tree or concatentated.pplacer.json + - This is a sign that CheckM has run out of memory, make sure you are using your desired executor. You may need to adjust configuration settings. diff --git a/docs/usage/Utilities.md b/docs/usage/Utilities.md index 8903f09b..2e949a0c 100644 --- a/docs/usage/Utilities.md +++ b/docs/usage/Utilities.md @@ -1,9 +1,9 @@ -# Utilities - -## Run script -The command line interface for Nextflow can become lengthy and tedious to type each time, due to the customization associated with routine pipeline runs, the lack of short form variable flags in Nextflow e.g. typing `--nanopore_chemisty` each time can be tedious. - -A run script skeleton has been provided in the `utils` folder of mikrokondo (`utils/mk_run.sh`). Please customize the script to make it fit your usage, if you have issues running your modified script make sure it is executable by adding running `chmod +x mk_run.sh`. - -## Parameter file -params-file -You can also add a params file to the launch of Nextflow from the command line. More information is provided [here](https://www.nextflow.io/blog/2020/cli-docs-release.html) +# Utilities + +## Run script +The command line interface for Nextflow can become lengthy and tedious to type each time, due to the customization associated with routine pipeline runs, the lack of short form variable flags in Nextflow e.g. typing `--nanopore_chemisty` each time can be tedious. + +A run script skeleton has been provided in the `utils` folder of mikrokondo (`utils/mk_run.sh`). Please customize the script to make it fit your usage, if you have issues running your modified script make sure it is executable by adding running `chmod +x mk_run.sh`. + +## Parameter file -params-file +You can also add a params file to the launch of Nextflow from the command line. More information is provided [here](https://www.nextflow.io/blog/2020/cli-docs-release.html) diff --git a/docs/usage/configuration.md b/docs/usage/configuration.md index b1b6df71..a7cf2095 100644 --- a/docs/usage/configuration.md +++ b/docs/usage/configuration.md @@ -1,949 +1,182 @@ -# Command line and Configuration Usage - - -## Configuration files - -When cloning the pipeline from github a directory labeled `conf/` will be present. Within the `conf/` folder two configuration files are of interest to the user: - -- base.config: Where cpu, memory and time parameters can be set for the different workflow processes. **You will likely need to adjust parameters within this file for you computing environment**. - -- modules.config: Where process specific parameters are set. It is unadvised to touch this configuration file unless performing pipeline development. - -### Base configuration (conf/base.config) -Within this file computing resources can be configured for each process. Different labels are listed defining different resources for different Nextflow processes. The defined labels you are encouraged to modify are: - -- `process_single`: Resource definitions for processes requiring only a single core and low memory (listing of directories). -- `process_low`: Resource definitions for processes that would typically run easily on a small laptop (Staging of data in a Python script). -- `process_medium`: Resource definitions for processes that would typically run on a desktop computer equipped for playing newer video games (Memory or computationally intensive applications that can be parallelized, rendering, processing large files in memory or running BLAST). -- `process_high`: Resource definition for processes that would typically run on a high performance desktop computer (Memory our computationally intensive application like running performing *de novo* assembly or performing BLAST searches on large databases). -- `process_long`: Modifies/overwrites the amount of time allowed for any of the above processes. Allows for certain jobs to take longer (Performing *de novo* assembly with less computational resources or performing global alignments on divergent sequences). -- `process_high_memory`: Modifies/overwrites the amount of memory given to any process. Grants significantly more memory to any process (Aids in metagenomic assembly or clustering of large datasets). - -## Containers - -Different container services can be specified from the command line when running mikrokondo in the `-profile` option. This option is specified at the end of your command line argument. Examples of different container services are specified below: - -- For Docker: `nextflow run main.nf MY_OPTIONS -profile docker` -- For Singularity: `nextflow run main.nf MY_OPTIONS -profile singularity` -- For Apptainer: `nextflow run main.nf MY_OPTIONS -profile apptainer` -- For Shifter: `nextflow run main.nf MY_OPTIONS -profile shifter` -- For Charliecloud: `nextflow run main.nf MY_OPTIONS -profile charliecloud` -- For Gitpod: `nextflow run main.nf MY_OPTIONS -profile gitpod` -- For Podman: `nextflow run main.nf MY_OPTIONS -profile podman` - -## Requirements - -1. Running MikroKondo requires have Nextflow installed, a Python interpreter 3.10=> and singularity or docker installed (only singularity is supported currently). - - Note Nextflow only runs on linux - - The easiest way to install Nextflow is using conda simply enter: - `conda create -n nextflow nextflow -c bioconda -c conda-forge -c default` - -## Downloading the pipeline -1. To download MikroKondo simply clone the repository - -## Running MikroKondo -### Samplesheet -Mikrokondo requires a sample sheet to be run A.K.A FOFN (file of file names). Having The sample sheet contains the samples name and allows a user to combine read-sets based on a sample name if provided. The sample-sheet utilizes the following header fields however: sample, fastq_1, fastq_2, long_reads and assembly. **The sample sheet must be in csv format and sample files must be gzipped on input** - -- The example layouts for different sample-sheets include: - - **Illumina paired-end data** - - |sample|fastq_1|fastq_2| - |------|-------|-------| - |sample_name|path_to_forward_reads|path_to_reversed_reads| - - **Nanopore** - - |sample|long_reads| - |------|----------| - |sample_name|path_to_reads| - - **Hybrid Assembly** - - |sample|fastq_1|fastq_2|long_reads| - |-------|-------|------|----------| - |sample_name|path_to_forward_reads|path_to_reversed_reads|path_to_long_reads| - - **Starting with assembly only** - - |sample|assembly| - |------|--------| - |sample_name|path_to_assembly| - -## Boiler plate options -Boiler plate options that are provided thanks to nf-core are listed below: - -- publish_dir_mode: Method used to save pipeline results to output directory -- email: Email address for completion summary. -- email_on_fail: An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully. -- plaintext_email: Send plain-text email instead of HTML. -- monochrome_logs: Do not use coloured log outputs. -- hook_url: Incoming hook URL for messaging service. Currently, MS Teams and Slack are supported. -- help: Display help text. -- version: Display version and exit. -- validate_params: Boolean whether to validate parameters against the schema at runtime. -- show_hidden_params: By default, parameters set as _hidden_ in the schema are not shown on the command line when a user runs with `--help`. Specifying this option will tell the pipeline to show all parameters. - -## Configuration/Command line Arguments - -Within the nextflow.config file the pipeline uses, the params section can be altered by users within the nextflow.config file or from the commandline -**TODO usage needs to be better written, just putting in basic info** - -MikroKondo can be run like most other nextflow pipelines. The most basic usage is as follows: - `nextflow run main.nf --input {USER INPUT SHEET HERE} --outdir {Output directory} --nanopore_chemistry {specify medaka model for polishing with ONT data} --platform {illumina, nanopore, pacbio, hybrid} {Optional parameters} -profile {singularity or docker} {optional -resume}` - -Mentioned above is an optional parameters section, many parameters can be altered or accessed from the command line. For a full list of parameters to be altered please refer to the `nextflow.config` file in the repo. - -## Platform specification -Mikrokondo allows for sequencing data from three platforms (must be FastQ): Illumina (paired end only), Nanopore and Pacbio (Pacbio path needs better testing). To specify which platform you are using from the command line you can enter: -- `--platform illumina` for Illumina. -- `--platform nanopore` for Nanopore. -- `--platform pacbio` for Pacbio -- `--platform hybrid` for hybrid assemblies. - - **If you pick denote your run as using a hybrid platform you must also add in the long_read_opt parameter the defualt value is nanopore**. `--long_read_opt nanopore` for nanopore or `--long_read_opt pacbio` for pacbio. - -## Nanopore Data -If you are using nanopore you must specify a model to use in Medaka for polishing (unless you turned on the skip_polishing option). A list of allowed models can be found here: [Medaka models python script](https://github.com/nanoporetech/medaka/blob/master/medaka/options.py) or [Medaka models available for download](https://github.com/nanoporetech/medaka/tree/master/medaka/data) - -A model can be specified like so `--nanopore_chemistry YOUR_MODEL_HERE`. A real example would look like this -> `--nanopore_chemistry r1041_e82_400bps_hac_v4.2.0` - -No default model is specified to prevent tiny errors that may affect data, but if your lab is using the same setting every time you can update the value in the `nextflow.config` labeled `nanopore_chemistry`. -An example of an update would be: -``` -nanopore_chemistry = "r1041_e82_400bps_hac_v4.2.0" // Note the quotes around the value -``` - -## Assembly with Flye (Nanopore and Pacbio) -As Flye provides different assembly optoins for Nanopore or Pacbio reads of varying qualities the `--fly_read_type` parameter can be altered from the command line. The default value is set too `hq` (High quality for Nanopore reads, and HiFi for bacterial reads). User options include `hq`, `corr` and `raw`, and a default value can be specified in the `nextflow.config` file. - -## Running Kraken2 instead of Mash -If you really like Kraken2 for speciation, you can enable it over Mash at the command line by specifying: -`--run_kraken true` - -If you wish to update this value for every run in the `nextflow.config` file you can update it to say: -``` -run_kraken = true // Note the lack of quotes -``` - -## Run Unicycler instead of Flye->Racon->Pilon -To use Unicycler specify on the command line `--hybrid_unicycler true`. If you would like to update this value so that the pipeline always uses Unicycler you can adjust the `nextflow.config` file like so: - -### Potential error with Unicycler -You may need to check the `conf/base.config` `process_high_memory` declaration and provide it upwards of 1000GB of memory if you get errors mentioning `tputs`. This error is not very clear sadly but increasing resources available to the process will help. - -``` -hybrid_unicycler = true // Note the lack of quotes -``` - -## Minimum number of reads for assembly -The `min_reads` option in the `nextflow.config` file is set to a limit of 1000. This means that after FastP has run 1000 reads must be present for the sample to proceed to assembly. If this values is not met, the sample does not proceed for assembly or pipeline steps. You can lower or raise this value from the command line like so: `--min_reads 100` - -If you wish to update this value in your `nextflow.config` file you can alter the value like so. -``` -min_reads = 1000 // Note the lack of quotes -``` - -## Target Depth -If you have not opted to skip down sampling you reads (elaborated on in the next section). You can set a target depth for sampling, e.g. if you set your `target_depth` value to 100 and sample has an estimated depth of 200bp your reads would be sampled to try and achieve a target depth of 100. If your sample has an estimated depth of 80 and your `target_depth` is 100, no downsampling will occur. - -To set you target depth from the command line enter `--target_depth 100` (or whatever number you want it to be). To update this value in your `nextflow.config` you can update it like so. -``` -target_depth = 100 // Note the lack of quotes -``` - -## Specify all samples as metagenomic -You can specify your samples are metagenomic directly which will affect the report, the pipeline will skip running mash to see if your samples are metagenomic and proceed with metagenomic assembly. - -To toggle on a metagenomic assembly simply enter `--metagenomic_run true`. To update the `nextflow.config` file to always run your samples as metagenomic simply change the `metagenomic_run` variable like so. -``` -metagenomic_run = true // Note the lack of quotes -``` - -## Skip Options -Numerous steps within mikrokondo can be turned off without compromising the stability of the pipeline. This skip options can reduce run-time of the pipeline or allow for completion of the pipeline despite errors. Currently these include - -- `skip_report` - - This option can be toggled on to prevent the generation of the final summary report, containing a condensed output the different tools run within the pipeline. -- `skip_version_gathering` - - Version information of each tool ran within the pipeline is collated as the pipeline runs before a final report is generated. While this is nice, it can be a time consuming process (a few minutes at worse) but when performing developing the pipeline turning this step off can make recurrent runs of the pipeline easier. -- `skip_subtyping` - - Subtyping tools such as: ECTyper, SISTR etc. are automatically triggered in the pipeline but if subtyping information is of no interest to you e.g. your target organism does not have a subtyping tool installed within mikrokondo you can turn this step off. -- `skip_abricate` - - Abricate (AMR detection) is included within the pipeline. It can be turned off if this step gives you trouble, however it is quite fast to run. -- `skip_bakta` - - Bakta is a full annotation pipeline that outputs a lot of very useful information. However it can be quite slow to run and requires a database to be specified. If this information is of no further interest to you, it is better to disable this process. -- `skip_checkm` - - CheckM is solely used as part of contamination detection within mikrokondo, its run time and resource usage can be quite lengthy. -- `skip_depth_sampling` - - The genome size of the reads is estimated using mash, and reads can be down-sampled to a target depth in order to get a better assembly. If this is of no interest to you, this step can be skipped entirely. **If you have specified that your run is metagenomic, down sampling is turned off**. -- `skip_ont_header_cleaning` - - In rare situations you may have Nanopore data fail in the pipeline due to duplicate headers, while rare it can be quite annoying for an assembly that has already been going on for more than 5 hours to fail. So there is a process in mikrokondo to make each Nanopore read header unique and avoid this issue, it is slow and usually unneeded so it is **best to run the pipeline with this step disabled**. -- `skip_polishing` - - If you are running a metagenomic assembly or you are having issues with any of the polishing steps you can disable polishing and retrieve your assembly directly from Spades or Flye with no additional polishing. **This does not apply to hybrid assemblies**. -- `skip_species_classification` - - This step prevents Mash or Kraken2 from being run on your assembled genome, this also **prevents the subtyping workflow** from triggering -- `skip_mobrecon` - - This step allows you to skip running mob-suite recon on your data. -- `skip_starmar` - - This step allows you to skip running StarAMR on your data. - -** All of the above options can be turned on by entering `--{skip_option} true` in the command line arguments to the pipeline (where optional parameters can be added)** e.g. To skip read sub-sampling add to the command line arguments `--skip_depth_sampling true` - -### Slurm options -- `slurm_p` - - if set to true, the slurm executor will be used. -- `slurm_profile` - - A string to allowing the user to specify which slurm partition to use - -### Max Resources -TODO this are the nextflow defaults -TODO make it known what should stay and what can be routinely changed - - -## Tool Specific Parameters -**NOTE:** to access tool specific parameters from the command line you must use the dot operator. e.g. In order to set the min contig length you would like Quast to use for generating report metrics from the command line you would specify `--quast.min_contig_length 500` - -All parameters below are nested, to denote the path to each parameter the options will be denoted nested so that at the top-level is the first level of the parameter, then sub bullets will be nested within that parameter. - -The example below is to show how parameters are denoted. - -``` -- tool - - tool_param1 - - tool_param2 -``` - -If as an example if, the `--quast.min_contig_lenth` parameter would be written as: - -``` -- quast - - min_contig_length -``` - -Note: Parameters that are bolded are ones that can be freely changed. Sensible defaults are provided however - -### Abricate -Screens contigs for antimicrobial and virulence genes. If you wish to use a different Abricate database you may need to update the container you use. - -- abricate - - singularity: Abricate singularity container - - docker: Abricate docker container - - **args**: Can be a string of additional command line arguments to pass to abricate - - report_tag: This field determines the name of the Abricate output in the final summary file. Do no touch this unless doing pipeline development. - - header_p: This field tells the report module that the Abricate output contains headers. Do no touch this unless doing pipeline development. - -### Raw Read Metrics -A custom Python script that gathers quality metrics for each fastq file. - -- raw_reads - - high_precision: When set to true, floating point precision of values output are accurate down to very small decimal places. Leave this setting as false to use the standard floats in Python, as it is much faster and having such precise decimal places does not quite make sense for the purpose this module fills. - - report_tag: this field determines the name of the Raw Read Metric field in the final summary report. Do no touch this unless doing pipeline development. - -### Coreutils -Some processes only utilize bash scripting, normally Nextflow will utilize system binaries if they are available and no container is specified. But in the case of a process where only scripting and binutils are being utilized as container was specified for reproducibility. - -- coreutils - - singularity: coreutils singularity container - - docker: coreutils docker container - - -### Python -Some scripts require Python, and to prevent someone requiring a Python we are just putting the requirement into a container for you. Also as all the scripts within mikrokondo use only the the standard library you can swap these containers to use **pypy3** and get a massive performance boost from the scripts! - -- python3 - - singularity: Python3 singularity container - - docker: Python3 docker container - -### KAT -Kat was previously used to estimate genome size, however at the time of writing KAT appears to be only infrequently updated and newer versions would have issues running/sometimes giving an incorrect output due to failures in peak recognition KAT has been removed from the pipeline. It's code still remains but it **will be removed in the future**. - -### Seqtk -Seqtk is used for both the sub-sampling of reads and conversion of fasta files to fastq files in mikrokondo. The usage of seqtk to convert a fasta to a fastq is needed to use Shigatyper as it requires fastq files as input, and do pass the reads to Shigatyper could results in a reduction of generalizability of the subtyping workflow. - -- seqtk - - singularity: singularity container for seqtk - - docker: docker container for seqtk - - seed: A seed value for sub-sampling - - reads_ext: Extension of reads after sub-sampling, do not touch alter this unless doing pipeline development. - - assembly_fastq: Extension of the fastas after being converted to fastq files. Do no touch this unless doing pipeline development. - - report_tag: Name of seqtk data in the final summary report. Do no touch this unless doing pipeline development. - -### FastP -Fastp is fast and widely used program for gathering of read quality metrics, adapter trimming, read filtering and read trimming. FastP has extensive options for configuration which are detailed in their documentation, but sensible defaults have been set. **Adapter trimming in Fastp is performed using overlap analysis, however if you do not trust this you can specify the sequencing adapters used directly in the additional arguments for Fastp**. - -- fastp - - singulartiy: singularity container for FastP - - docker: docker container for FastP - - fastq_ext: extension of the output Fastp trimmed reads, do not touch this unless doing pipeline development. - - html_ext: Extension of the html report output by fastp, do no touch unless doing pipeline development. - - json_ext: Extension of json report output by FastP do not touch unless doing pipeline development. - - report_tag: Title of FastP data in the summary report. - - **average_quality_e**: If a read/read-pair quality is less than this value it is discarded - - **cut_mean_quality**: The quality to trim reads too - - **qualified_quality_phred**: the quality of a base to be qualified if filtering by unqualified bases - - **unqualified_percent_limit**: The percent amount of bases that are allowed to be unqualified in a read. This parameter is affected by the above qualified_quality_phred parameter - - **illumina_length_min**: The minimum read length to be allowed in illumina data - - **single_end_length_min**: the minimum read length allowed in Pacbio or Nanopore data - - **dedup_reads**: A parameter to be turned on to allow for deduplication of reads. - - **illumina_args**: The command string passed to Fastp when using illumina data, if you override this parameter other set parameters such as average_quality_e must be overridden as well as the command string will be passed to FastP as written - - **single_end_args**: The command string passed to FastP if single end data is used e.g. Pacbio or Nanopore data. If this option is overridden you must specify all parameters passed to Fastp as this string is passed to FastP as written. - - report_exclude_fields: Fields in the summary json to be excluded from the final aggregated report. Do not alter this field unless doing pipeline development - -### Chopper -Chopper was originally used for trimming of Nanopore reads, but FastP was able to do the same workd so Chopper is no longer used. Its code still but it cannot be run in the pipeline. - -### Flye -Flye is used for assembly of Nanopore data. - -- flye - - nanopore - - raw: corresponds to the option in Flye of `--nano-raw` - - corr: corresponds to the option in Flye of `--nano-corr` - - hq: corresponds to the option in Flye of `--nano-hq` - - pacbio - - raw: corresponds to the option in Flye of `--pacbio-raw` - - corr: corresponds to the option in Flye of `--pacbio-corr` - - hifi: corresponds to the option in Flye of `--pacbio-hifi` - - singularity: Singularity container for Flye - - docker: Docker container for Flye - - fasta_ext: The file extension for fasta files. Do not alter this field unless doing pipeline development - - gfa_ext: The file extension for gfa files. Do not alter this field unless doing pipeline development - - gv_ext: The file extension for gv files. Do not alter this field unless doing pipeline development - - txt_ext: the file extension for txt files. Do not alter this field unless doing pipeline development - - log_ext: the file extension for the Flye log files. Do not alter this field unless doing pipeline development - - json_ext: the file extension for the Flye json files. Do not alter this field unless doing pipeline development - - **polishing_iterations**: The number of polishing iterations for Flye. - - ext_args: Extra commandline options to pass to Flye - -### Spades -Usef for parired end read assembly - -- spades - - singularity: Singularity container for spades - - docker: Docker container for spades - - scaffolds_ext: The file extension for the scaffolds file. Do not alter this field unless doing pipeline development - - contigs_ext: The file extension containing assembled contigs. Do not alter this field unless doing pipeline development - - transcripts_ext: The file extension for the assembled transcripts. Do not alter this field unless doing pipeline development - - assembly_graphs_ext: the file extension of the assembly graphs. Do not alter this field unless doing pipeline development - - log_ext: The file extension for the log files. Do not alter this field unless doing pipeline development - - outdir: The name of the output directory for assemblies. Do not alter this field unless doing pipeline development - -### FastQC -This is a defualt tool added to nf-core pipelines. This feature will likely be removed in the future but for those fond of it, the outputs of FastQC still remain. - -- fastqc - - html_ext: The file extension of the fastqc html file. Do not alter this field unless doing pipeline development - - zip_ext: The file extension of the zipped FastQC outputs. Do not alter this field unless doing pipeline development - -### Quast -Quast is used to gather assembly metrics which automated quality control criteria are the applied too. - -- quast - - singularity: Singularity container for quast. - - docker: Docker container for quast. - - suffix: The suffix attached to quast outputs. Do not alter this field unless doing pipeline development. - - report_base: The base term for output quast files to be used in reporting. Do not alter this field unless doing pipeline development. - - report_prefix: The prefix of the quast outputs to be used in reporting. Do not alter this field unless doing pipeline development. - - **min_contig_length**: The minimum length of for contigs to be used in quasts generation of metrics. Do not alter this field unless doing pipeline development. - - **args**: A command string to past to quast, altering this is unadvised as certain options may affect your reporting output. This string will be passed to quast verbatim. Do not alter this field unless doing pipeline development. - - header_p: This tells the pipeline that the Quast report outputs contains a header. Do not alter this field unless doing pipeline development. - -### Quast Filter -Assemblies can be prevented from going into further analyses based on the Quast output. The options for the mentioned filter are listed here. - -- quast_filter - - n50_field: The name of the field to search for and filter. Do not alter this field unless doing pipeline development. - - n50_value: The minimum value the field specified is allowed to contain. - - nr_contigs_field: The name of field in the Quast report to fiter on. Do not alter this field unless doing pipeline development. - - nr_contigs_value: The minimum number of contigs an assembly must have to proceed further through the pipeline. - - sample_header: The column name in the Quast output containing the sample information. Do not alter this field unless doing pipeline development. - -### CheckM -CheckM is used within the pipeline for assesing contamination in assemblies. - -- checkm - - singularity: Singularity container containing CheckM - - docker: Docker container containing CheckM - - alignment_ext: Extension on the genes alignment within CheckM. Do not alter this field unless doing pipeline development. - - results_ext: The extension of the file containing the CheckM results. Do not alter this field unless doing pipeline development. - - tsv_ext: The extension containing the tsv results from CheckM. Do not alter this field unless doing pipeline development. - - folder_name: The name of the folder containing the outputs from CheckM. Do not alter this field unless doing pipeline development. - - gzip_ext: The compression extension for CheckM. Do not alter this field unless doing pipeline development. - - lineage_ms: The name of the lineages.ms file output by CheckM. Do not alter this field unless doing pipeline development. - - threads: The number of threads to use in CheckM. Do not alter this field unless doing pipeline development. - - report_tag: The name of the CheckM data in the summary report. Do not alter this field unless doing pipeline development. - - header_p: Denotes that the result used by the pipeline in generation of the summary report contains a header. Do not alter this field unless doing pipeline development. - -### Kraken2 -Kraken2 can be used a substitute for mash in speciation of samples, and it is used to bin contigs of metagenomic samples. - -- kraken - - singularity: Singularity container for the Kraken2. - - docker: Docker container for Kraken2. - - classified_suffix: Suffix for classified data from Kraken2. Do not alter this field unless doing pipeline development. - - unclassified_suffix: Suffic for unclassified data from Kraken2. Do not alter this field unless doing pipeline development. - - report_suffix: The name of the report output by Kraken2. - - output_suffix: The name of the output file from Kraken2. Do not alter this field unless doing pipeline development. - - **tophit_level**: The taxonomic level to classify a sample at. e.g. default is `S` for species but you could use `S1` or `F`. - - save_output_fastqs: Option to save the output fastq files from Kraken2. Do not alter this field unless doing pipeline development. - - save_read_assignments: Option to save how Kraken2 assigns reads. Do not alter this field unless doing pipeline development. - - **run_kraken_quick**: This option can be set to `true` if one wishes to run Kraken2 in quick mode. - - report_tag: The name of the Kraken2 data in the final report. Do not alter this field unless doing pipeline development. - - header_p: Tell the pipeline that the file used for reporting does or does not contain header data. Do not alter this field unless doing pipeline development. - - headers: A list of headers in the Kraken2 report. Do not alter this field unless doing pipeline development. - - -### Seven Gene MLST -Run Torstein Tseemans seven gene MLST program. - -- mlst - - singularity: Singularity container for mlst. - - docker: Docker container for mlst. - - **args**: Addtional arguments to pass to mlst. - - tsv_ext: Extension of the mlst tabular file. Do not alter this field unless doing pipeline development. - - json_ext: Extension of the mlst output JSON file. Do not alter this field unless doing pipeline development. - - report_tag: Name of the data outputs in the final report. Do not alter this field unless doing pipeline development. - -### Mash -Mash is used repeatedly througout the pipeline for estimation of genome size from reads, contamination detection and for determining the final species of an assembly. - -- mash - - singularity: Singularity container for mash. - - docker: Docker container for mash. - - mash_ext: Extension of the mash screen file. Do not alter this field unless doing pipeline development. - - output_reads_ext: Extension of mash outputs when run on reads. Do not alter this field unless doing pipeline development. - - output_taxa_ext: Extension of mash output when run on contigs. Do not alter this field unless doing pipeline development. - - mash_sketch: The GTDB sketch used by the pipeline, this sketch is special as it contains the taxonomic paths in the classification step of the pipeline. It can as of 2023-10-05 be found here: https://zenodo.org/record/8408361 - - sketch_ext: File extension of a mash sketch. Do not alter this field unless doing pipeline development. - - json_ext: File extension of json data output by Mash. Do not alter this field unless doing pipeline development. - - sketch_kmer_size: The size of the kmers used in the sketching in genome size estimation. - - **min_kmer**: The minimum number of kmer copies required to pass the noise filter. this value is used in estimation of genome size from reads. The default value is 10 as it seems to work well for Illumina data. - - final_sketch_name: **to be removed** This parameter was originally part of a subworkflow included in the pipeline for generation of the GTDB sketch. But this has been removed and replaced with scripting. - - report_tag: Report tag for Mash in the summary report. Do not alter this field unless doing pipeline development. - - header_p: Tells the pipeline if the output data contains headers. Do not alter this field unless doing pipeline development. - - headers: A list of the headers the output of mash should contain. Do not alter this field unless doing pipeline development. - -### Mash Meta -This process is used to determine if a sample is metagenomic or not. - -- mash_meta. - - report_tag: The name of this output field in the summary report. Do not alter this field unless doing pipeline development. - -### top_hit_species: -As Kraken2 of Mash can be used for determining the species present in the pipeline, the share a common report tag. - -- top_hig_species - - report_tag: The name of the determined species in the final report. Do not alter this field unless doing pipeline development. - -### Contamination Removal -This step is used to remove contaminants from read data, it exists to perform dehosting, and removal of kitomes. - -- r_contaminants - - singularity: Singularity container used to perform dehosting, this container contains minimap2 and samtools. - - docker: Docker container used to perform dehosting, this container contains minimap2 and samtools. - - phix_fa: The path to file containing the phiX fasta. - - homo_sapiens_fa: The path to file containing the human genomes fasta. - - pacbio_mg: The path to file containg the pacbio sequencing control. - - output_ext: The extension of the deconned fastq files. Do not alter this field unless doing pipeline development. - - mega_mm2_idx: The path to the minimap2 index used for dehosting. Do not alter this field unless doing pipeline development. - - mm2_illumina: The arguments passed to minimap2 for Illumina data. Do not alter this field unless doing pipeline development. - - mm2_pac: The arguments passed to minimap2 for Pacbio Data. Do not alter this field unless doing pipeline development. - - mm2_ont: The arguments passed to minimap2 for Nanopore data. Do not alter this field unless doing pipeline development. - - samtools_output_ext: The extension of the output from samtools. Do not alter this field unless doing pipeline development. - - samtools_singletons_ext: The extension of singelton reads from samtools. Do not alter this field unless doing pipeline development. - - output_ext: The name of the files output from samtools. Do not alter this field unless doing pipeline development. - - output_dir: The directory where deconned reads are placed. Do not alter this field unless doing pipeline development. - -### Minimap2 -Minimap2 is used frequently throughout the pipeline for decontamination and mapping reads back to assemblies for polishing. - -- minimap2 - - singularity: The singularity container for minimap2, the same one is used for contmaination removal. - - docker: The Docker container for minimap2, the same one is used for contmaination removal. - - index_outdir: The directory where created indices are output. Do not alter this field unless doing pipeline development. - - index_ext: The file extension of create indices. Do not alter this field unless doing pipeline development. - -### Samtools -Samtools is used for sam to bam conversion in the pipeline. - -- samtools - - singularity: The Singularity container containing samtools, the same container is used as the one in contamination removal. - - docker: The Docker container containing samtools, the same container is used as the on in contamination removal. - - bam_ext: The extension of the bam file from samtools. Do not alter this field unless doing pipeline development. - - bai_ext: the extension of the bam index from samtools. Do not alter this field unless doing pipeline development. - -### Racon -Racon is used as a first pass for polishing assemblies. - -- racon - - singularity: The Singularity container containing racon. - - docker: The Docker container containing racon. - - consensus_suffix: The suffix for racons outputs. Do not alter this field unless doing pipeline development. - - consensus_ext: The file extension for the racon consensus sequence. Do not alter this field unless doing pipeline development. - - outdir: The directory containing the polished sequences. Do not alter this field unless doing pipeline development. - -### Pilon -Pilon was added to the pipeline, but it is run iteratively which at the time of writing this pipeline was not well supported in Nextflow so a seperate script and containers are provided to utilize Pilon. The code for Pilon remains in the pipeline so that when able to do so easily, iterative Pilon polishing can be integrated directly into the pipeline. - -### Pilon Iterative Polishing -This process is a wrapper around minimap2, samtools and Pilon for iterative polishing containers are built **but if you ever have problems with this step, disabling polishing will fix your issue (at the cost of polishing)**. - -- pilon_iterative - - singularity: The container containing the iterative pilon program. If you ever have issues with the singularity image you can use the Docker image as Nextflow will automatically convert the docker image into a singularity image. - - docker: The Docker container for the Pilon iterative polisher. - - outdir: The directory where polished data is output. Do not alter this field unless doing pipeline development. - - fasta_ext: File extension for the fasta to be polished. Do not alter this field unless doing pipeline development. - - fasta_outdir: The output directory name for the polished fastas. Do not alter this field unless doing pipeline development. - - vcf_ext: File extension for the VCF output by Pilon. Do not alter this field unless doing pipeline development. - - vcf_outdir: output directory containing the VCF files from Pilon. Do not alter this field unless doing pipeline development. - - bam_ext: Bam file extension. Do not alter this field unless doing pipeline development. - - bai_ext: Bam index file extension. Do not alter this field unless doing pipeline development. - - changes_ext: File extensions for the pilon output containing the changes applied to the assembly. Do not alter this field unless doing pipeline development. - - changes_outdir: The output directory for the pilon changes. Do not alter this field unless doing pipeline development. - - max_memory_multiplier: On failure this program will try again with more memory, the mulitplier is the factor that the amount of memory passed to the program will be increased by. Do not alter this field unless doing pipeline development. - - **max_polishing_illumina**: Number of iterations for polishing an illuina assembly with illumina reads. - - **max_polishing_nanopre**: Number of iterations to polish a Nanopore assembly with (will use illumina reads if provided). - - **max_polishing_pacbio**: Number iterations to polish assembly with (will use illumina reads if provided). - -### Medaka Polishing -Medaka is used for polishing of Nanopore assemblies, make sure you specify a medaka model when using the pipeline so the correct settings are applied. If you have issues with Medaka running, try disabling resume or alternatively **disable polishing** as Medaka can be troublesome to run. - -- medaka - - singularity: Singularity container with Medaka. - - docker: Docker container with Medaka. - - model: This parameter will be autofilled with the model specified at the top level by the `nanopore_chemistry` option. Do not alter this field unless doing pipeline development. - - fasta_ext: Polished fasta output. Do not alter this field unless doing pipeline development. - - batch_size: The batch size passed to medaka, this can improve performance. Do not alter this field unless doing pipeline development. - -### Unicycler -Unicycler is an option provided for hybrid assembly, it is a great option and outputs an excellent assembly but it requires **alot** of resources. Which is why the alternate hybrid assembly option using Flye->Racon->Pilon is available. As well there can be a fairly cryptic Spades error generated by Unicycler that usaully relates to memory usage, It will typically say something involving `tputs`. - -- unicycler - - singularity: The Singularity container containing Unicycler. - - docker: The Docker container containing Unicycler. - - scaffolds_ext: The scaffolds file extension output by unicycler. Do not alter this field unless doing pipeline development. - - assembly_ext: The assembly extension output by Unicycler. Do not alter this field unless doing pipeline development. - - log_ext: The log file output by Unicycler. Do not alter this field unless doing pipeline development. - - outdir: The output directory the Unicycler data is sent to. Do not alter this field unless doing pipeline development. - - mem_modifier: Specifies a high amount of memory for Unicycler to prevent a common spades error that is fairly cryptic. Do not alter this field unless doing pipeline development. - - threads_increase_factor: Factor to increase the number of threads passed to Unicycler. Do not alter this field unless doing pipeline development. - - -### Mob-suite Recon -mob-suite recon provides annotation of plasmids in you data. - -- mobsuite_recon - - singularity: The singularity container containing mob-suite recon. - - docker: The Docker container containing mob-suite recon. - - **args**: Additional arguments to pass to mobsuite. - - fasta_ext: The file extension for FASTAs. Do not alter this field unless doing pipeline development. - - results_ext: The file extension for results in mob-suite. Do not alter this field unless doing pipeline development. - - mob_results_file: The final results to be included in the final report by mob-suite. Do not alter this field unless doing pipeline development. - - report_tag: The field name of mob-suite data in the final report. Do not alter this field unless doing pipeline development. - - header_p: Default is `true` and indicates that the results output by mob-suite contains a header. Do not alter this field unless doing pipeline development. - -## StarAMR -StarAMR provides annotation of antimicrobial resistance genes within your data. The process will alter FASTA headers of input files to ensure the header length <50 characters long. - -- staramr - - singularity: The singularity container containing staramr. - - docker: The Docker container containing starmar. - - **db**: The database for StarAMR. The default value of `null` tells the pipeline to use the database included in the StarAMR container. However you can specify a path to a valid StarAMR datbase and use that instead. - - tsv_ext: File extension of the reports from StarAMR. Do not alter this field unless doing pipeline development. - - txt_ext: File extension of the text reports from StarAMR. Do not alter this field unless doing pipeline development. - - xlsx_ext: File extension of the excel spread sheet from StarAMR. Do not alter this field unless doing pipeline development. - - **args**: Additional arguments to pass to StarAMR. Do not alter this field unless doing pipeline development. - - point_finder_dbs: A list containing the valid databases StarAMR supports for pointfinder. The way they are structured matches what StarAMR needs for input. Do not alter this field unless doing pipeline development. Do not alter this field unless doing pipeline development. - - report_tag: The field name of StarAMR in the final summary report. Do not alter this field unless doing pipeline development. - - header_p: Indicates the final report from StarAMR contains a header line. Do not alter this field unless doing pipeline development. - -### Bakta -Bakta is use to provide annotation of genomes, it is very reliable but it can be slow. - -- bakta - - singularity: The singularity container containing Bakta. - - docker: The Docker container containing Bakta. - - **db**: the path where the downloaded Bakta database should be downloaded. - - output_dir: The name of the folder where Bakta data is saved too. Do not alter this field unless doing pipeline development. - - embl_ext: File extension of embl file. Do not alter this field unless doing pipeline development. - - faa_ext: File extension of faa file. Do not alter this field unless doing pipeline development. - - ffn_ext: File extension of the ffn file. Do not alter this field unless doing pipeline development. - - fna_ext: File extension of the fna file. Do not alter this field unless doing pipeline development. - - gbff_ext: File extension of gbff file. Do not alter this field unless doing pipeline development. - - gff_ext: File extension of GFF file. Do not alter this field unless doing pipeline development. - - threads: Number of threads for Bakta to use, remember more is not always better. Do not alter this field unless doing pipeline development. - - hypotheticals_tsv_ext: File extension for hypothetical genes. Do not alter this field unless doing pipeline development. - - hypotheticals_faa_ext: File extension of hypothetical genes fasta. Do not alter this field unless doing pipeline development. - - tsv_ext: The file extension of the final bakta tsv report. Do not alter this field unless doing pipeline development. - - txt_ext: The file extension of the txt report. Do not alter this field unless doing pipeline development. - - min_contig_length: The minimum contig length to be annotated by Bakta. - -### Bandage -Bandage is included to make bandage plots of the initial assemblies e.g. Spades, Flye or Unicycler. These images can be useful in determining the quality of an assembly. - -- bandage - - singularity: The path to the singularity image containing bandage. - - docker: The path to the docker file containing bandage. - - svg_ext: The extension of the SVG file created by bandage. Do not alter this field unless doing pipeline development. - - outdir: The output directory of the bandage images. - -### Subtyping Report -All sub typing report tools contain a common report tag so that they can be identified by the program. - -- subtyping_report - - report_tag: Subtyping report name. Do not alter this field unless doing pipeline development. - -### ECTyper -ECTyper is used to perform *in-silico* typing of *Escherichia coli* and is automatically triggered by the pipeline. - -- ectyper - - singularity: The path to the singularity container containing ECTyper. - - docker: The path to the Docker container containing ECTyper. - - log_ext: File extension of the ECTyper log file. Do not alter this field unless doing pipeline development. - - tsv_ext: File extension of the ECTyper text file. Do not alter this field unless doing pipeline development. - - txt_ext: Text file extension of ECTyper output. Do not alter this field unless doing pipeline development. - - report_tag: Report tag for ECTyper data. Do not alter this field unless doing pipeline development. - - header_p: denotes if the table output from ECTyper contains a header. Do not alter this field unless doing pipeline development. - -### Kleborate -Kleborate performs automatic typing of *Kelbsiella*. - -- kleborate - - singularity: The path to the singularity container containing Kleborate. - - docker: The path to the docker container containing Kleborate. - - txt_ext: The subtyping report tag for Kleborate. Do not alter this field unless doing pipeline development. - - report_tag: The report tag for Kleborate. Do not alter this field unless doing pipeline development. - - header_p: Denotes the Kleborate table contains a header. Do not alter this field unless doing pipeline development. - -### Spatyper -Performa typing of *Staphylococcus* species. - -- spatyper - - singularity: The path to the singularity container containing Spatyper. - - docker: The path to docker container containing Spatyper. - - tsv_ext: The file extension of the Spatyper output. Do not alter this field unless doing pipeline development. - - report_tag: The report tag for Spatyper. Do not alter this field unless doing pipeline development. - - header_p: denotes whether or not the output table contains a header. Do not alter this field unless doing pipeline development. - - repeats: An optional file specifying repeats can be passed to Spatyper. - - repeat_order: An optional file containing a repeat ordet to pass to Spatyper. - -### SISTR -*In-silico Salmonella* serotype prediction. - -- sistr - - singularity: The path to the singularity container containing SISTR. - - docker: The path to the Docker container containing SISTR. - - tsv_ext: The file extension of the SISTR output. Do not alter this field unless doing pipeline development. - - allele_fasta_ext: The extension of the alleles identified by SISTR. Do not alter this field unless doing pipeline development. - - allele_json_ext: The extension to the output JSON file from SISTR. Do not alter this field unless doing pipeline development. - - cgmlst_tag: The extension of the CGMLST file from SISTR. Do not alter this field unless doing pipeline development. - - report_tag: The report tag for SISTR. Do not alter this field unless doing pipeline development. - - header_p: Denotes whether or not the output table contains a header. Do not alter this field unless doing pipeline development. - -### Lissero -*in-silico Listeria* typing. - -- lissero - - singularity: The path to the singularity container containing Lissero. - - docker: The path to the docker container containing Lissero. - - tsv_ext: The file extension of the Lissero output. Do not alter this field unless doing pipeline development. - - report_tag: The report tag for Lissero. Do not alter this field unless doing pipeline development. - - header_p: Denotes if the output table of Lissero contains a header. Do not alter this field unless doing pipeline development. - -### Shigeifinder -*in-silico Shigella* typing. **NOTE:** It is unlikely this subtyper will be triggered as GTDB has merged *E.coli* and *Shigella* and updated sketch and updated ECTyper will be released soon to address the shortfalls of this sketch. If you are relying on *Shigella* detection add `--run_kraken true` to your command line or update the value in the `.nextflow.config` as Kraken2 (while slower) can still detect *Shigella*. - -- shigeifinder - - singularity: The Singularity container containing Shigeifinder. - - docker: The path to the Docker container containing Shigeifinder. - - container_version: The version number **to be updated with the containers** as Shigeifinder does not currently have a version number tracked in the command. - - tsv_ext: Extension of output report. - - report_tag: The name of the output report for shigeifinder. - - header_p: Denotes that the output from Shigeifinder includes header values. - - -### Shigatyper (Replaced with Shigeifinder) -Code still reamins but it will likely be removed later on. - -- shigatyper - - singularity: The Singularity container containing Shigatyper. - - docker: The path to the Docker container containing Shigatyper. - - tsv_ext: The tsv file extension. Do not alter this field unless doing pipeline development. - - report_tag: The report tag for Shigatyper. Do not alter this field unless doing pipeline development. - - header_p: Denotes if the report output contains a header. Do not alter this field unless doing pipeline development. - -### Kraken2 Contig Binning -Bins contigs based on the Kraken2 output for contaminated/metagenomic samples. This is implemeted by using a custom script. - -- kraken_bin - - **taxonomic_level**: The taxonomic level to bin contigs at. Binning at species level is not recommended the default is to bin at a genus level which is specied by a character of `G`. To bin at a higher level such as family you would specify `F`. - - fasta_ext: The extension of the fasta files output. Do not alter this field unless doing pipeline development. - - -## Quality Control Report -Tread carefully here, as this will require modification of the `nextflow.config` file. **Make sure you have saved a back up of your `nextflow.config` file before playing with these option** - -#### After you have backed up you `nextflow.config` please proceed - -### QCReport field desciption -The section of interest is the `QCReport` fields in the params section of the `nextflow.config`. There are multiple sections with values that can be modified or you can add data for a different organism. The default values in the pipeline are set up for **Illumina data** so you may need to adjust setting for Nanopore or Pacbio data. - -An example of the QCReport structure is shown below. With annotation describing the values. **NOTE** The values below do not affect the running of the pipeline, these values only affect the final quality messages output by the pipeline. -``` -QCReport { - escherichia // Generic top level name fo the field, it is name is technically arbitrary but it nice field name keeps things organized - { - search = "Escherichia coli" // The phrase that is searched for in the species_top_hit field mentioned above. The search is for containment so if you wanted to look for E.coli and E.albertii you could just set the value too "Escherichia" - raw_average_quality = 30 // Minimum raw average quality of all bases in the sequencing data. This value is generated before the decontamination procedure. - min_n50 = 95000 // The minimum n50 value allowed from quast - max_n50 = 6000000 // The maximum n50 value allowed from quast - min_nr_contigs = 1 // the minimum number of contigs a sample is allowed to have, a value of 1 works as a sanity check - max_nr_contigs = 500 // The maximum number of contigs the organism in the search field is allowed to have. to many contigs could indicate a bad assembly or contamination - min_length = 4500000 // The minimum genome length allowed for the organism specified in the search field - max_length = 6000000 // The maxmimum genome length the organism in the search field is allowed to have - max_checkm_contamination = 3.0 // The maximum level of allowed contamination allowed by CheckM - min_average_coverage = 30 // The minimum average coverage allowed - } - // DO NOT REMOVE THE FALLTRHOUGH FIELD AS IT IS NEEDED TO CAPTURE OTHER ORGANISMS - fallthrough // The fallthrough field exist as a default value to capture organisms where no quality control data has been specified - { - search = "No organism specific QC data available." - raw_average_quality = 30 - min_n50 = null - max_n50 = null - min_nr_contigs = null - max_nr_contigs = null - min_length = null - max_length = null - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } -} -``` - -### Example adding quality control data for *Salmonella* - -If you wanted to add quality control data for *Salmonella* you can start off by using the template below: - -``` -VAR_NAME { // Replace VAR name with the genus name of your sample, only use ASCII (a-zA-Z) alphabet characters in the name and replace spaces, punctuation and other special characters with underscores (_) - search = "Search phrase" // Search phrase for your species top_hit, Note the quotes - raw_average_quality = // 30 is a default value please change it as needed - min_n50 = // Set your minimum n50 value - max_n50 = // Set a maximum n50 value - min_nr_contigs = // Set a minimum number of contigs - max_nr_contigs = // The maximum number of contings - min_length = // Set a minimum genome length - max_length = // set a maximum genome length - max_checkm_contamination = // Set a maximum level of contamination to use - min_average_coverage = // Set the minimum coverage value -} -``` - -For *Salmonella* I would fill in the values like so. -``` -salmonella { - search = "Salmonella" - raw_average_quality = 30 - min_n50 = 95000 - max_n50 = 6000000 - min_nr_contigs = 1 - max_nr_contigs = 200 - min_length = 4400000 - max_length = 6000000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 -} -``` - - -After having my values filled out, I can simply add them to the QCReport section in the `nextflow.config` file. - -``` - QCReport { - escherichia { - search = "Escherichia coli" - raw_average_quality = 30 - min_n50 = 95000 - max_n50 = 6000000 - min_nr_contigs = 1 - max_nr_contigs = 500 - min_length = 4500000 - max_length = 6000000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } salmonella { // NOTE watch the opening and closing brackets - search = "Salmonella" - raw_average_quality = 30 - min_n50 = 95000 - max_n50 = 6000000 - min_nr_contigs = 1 - max_nr_contigs = 200 - min_length = 4400000 - max_length = 6000000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } - fallthrough { - search = "No organism specific QC data available." - raw_average_quality = 30 - min_n50 = null - max_n50 = null - min_nr_contigs = null - max_nr_contigs = null - min_length = null - max_length = null - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } - } -``` - -### The current default settings are listed below -``` -QCReport { - escherichia { - search = "Escherichia coli" - raw_average_quality = 30 - min_n50 = 95000 - max_n50 = 6000000 - min_nr_contigs = 1 - max_nr_contigs = 500 - min_length = 4500000 - max_length = 6000000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } - salmonella { - search = "Salmonella" - raw_average_quality = 30 - min_n50 = 95000 - max_n50 = 6000000 - min_nr_contigs = 1 - max_nr_contigs = 200 - min_length = 4400000 - max_length = 6000000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } - shigella { - search = "Shigella" - raw_average_quality = 30 - min_n50 = 17500 - max_n50 = 5000000 - min_nr_contigs = 1 - max_nr_contigs = 500 - min_length = 4300000 - max_length = 5000000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } - listeria { - search = "Listeria" - raw_average_quality = 30 - min_n50 = 45000 - max_n50 = 3200000 - min_nr_contigs = 1 - max_nr_contigs = 200 - min_length = 2700000 - max_length = 3200000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } - campylobacter { - search = "Campylobacter" - raw_average_quality = 30 - min_n50 = 9500 - max_n50 = 2000000 - min_nr_contigs = 1 - max_nr_contigs = 150 - min_length = 1400000 - max_length = 2000000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } - vibrio { - search = "Vibrio" - raw_average_quality = 30 - min_n50 = 95000 - max_n50 = 4300000 - min_nr_contigs = 1 - max_nr_contigs = 150 - min_length = 3800000 - max_length = 4300000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } - // Some of these defaults are made up - klebsiella { - search = "Klebsiella" - raw_average_quality = 30 - min_n50 = 100000 - max_n50 = 6000000 - min_nr_contigs = 1 - max_nr_contigs = 500 - min_length = 4500000 - max_length = 6000000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } - staphylococcus { - search = "Staphylococcus" - raw_average_quality = 30 - min_n50 = 100000 - max_n50 = 3500000 - min_nr_contigs = 1 - max_nr_contigs = 550 - min_length = 2000000 - max_length = 3500000 - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } - fallthrough { - search = "No organism specific QC data available." - raw_average_quality = 30 - min_n50 = null - max_n50 = null - min_nr_contigs = null - max_nr_contigs = null - min_length = null - max_length = null - max_checkm_contamination = 3.0 - min_average_coverage = 30 - } -} -``` - - -## Quality Control Fields -This section affects the behaviours of the final summary quality control messages and is noted in the `QCReportFields` within the `nextflow.config`. **I would advise against manipulating this section unless you really know what you are doing**. - -TODO test what happens if no quality msg is available for the bool fields types. - -Each value in the QC report fields contains the following fields. - -- Field name - - path: path to the information in the summary report JSON - - coerce_type: Type to be coreced too, can be a Float, Integer, or Bool - - compare_fields: A list of fields corresponding to fields in the `QCReport` section of the `nextflow.config`. If two values are specified it will be assumed you wish to check that a value is in between a range of values. - - comp_type: The comparison type specified, 'ge' for greater or equal, 'le' for less than or equal, 'bool' for true or false or 'range' for checking if a value is between two values. - - on: A boolean value for disabling a comparison - - low_msg: A message for if a value is less than its compared value (optional) - - high_msg: A message for if value is above a certain value (optional) - -An example of what these fields look like is: - -``` -QCReportFields { - raw_average_quality { - path = [params.raw_reads.report_tag, "combined", "qual_mean"] - coerce_type = 'Float' - compare_fields = ['raw_average_quality'] - comp_type = "ge" - on = true - low_msg = "Base quality is poor, resequencing is recommended." - } -} -``` - +# Configuration +## Configuration files overview + +The following files contain configuration settings: + +- `conf/base.config`: Where cpu, memory and time parameters can be set for the different workflow processes. **You will likely need to adjust parameters within this file for your computing environment**. + +- `conf/modules.config`: contains error strategy, output director structure and execution instruction parameters. **It is unadvised to alter this file unless involved in pipeline development, or tuning to a system.** + +- `nextflow.config`: contains default tool settings that tie to CLI options. These options can be directly set within the `params` section of this file in cases when a user has optimized their pipeline usage and has identified the flags they will use every time the pipeline is run. + +### Base configuration (conf/base.config) +Within this file computing resources can be configured for each process. Mikrokondo uses labels to define resource requirements for each process, here are their definitions: + +- `process_single`: processes requiring only a single core and low memory (e.g., listing of directories). +- `process_low`: processes that would typically run easily on a small laptop (e.g., staging of data in a Python script). +- `process_medium`: processes that would typically run on a desktop computer equipped for playing newer video games (Memory or computationally intensive applications that can be parallelized, e.g., rendering, processing large files in memory or running BLAST). +- `process_high`: processes that would typically run on a high performance desktop computer (Memory or computationally intensive application, e.g., performing *de novo* assembly or performing BLAST searches on large databases). +- `process_long`: modifies/overwrites the amount of time allowed for any of the above processes to allow for certain jobs to take longer (e.g., performing *de novo* assembly with less computational resources or performing global alignments on divergent sequences). +- `process_high_memory`: modifies/overwrites the amount of memory given to any process and grant significantly more memory to any process (Aids in metagenomic assembly or clustering of large datasets). + +For actual resource amounts allotted to each process definition, see the `conf/base.config` file _Process-specific resource requirements_ section. + +### Hardcoded tool configuration (nextflow.config) +All Command line arguments and defaults can be set and/or altered in the `nextflow.config` file, _params_ section. For a full list of parameters to be altered please refer to the `nextflow.config` file in the repo. Some common arguments have been listed in the [Common command line arguments](/usage/useage/#common-command-line-arguments) section of the docs and further description of tool parameters can also be found in [tool specific parameters](/usage/tool_params/). + +> **Example:** if your laboratory typically sequences using Nanopore chemistry "r1041_e82_400bps_hac_v4.2.0", the following code would be substituted in the _params_ section of the `nextflow.config` file: +> +>``` +>nanopore_chemistry = "r1041_e82_400bps_hac_v4.2.0" // Note the quotes around the value +>``` +> +>With this change, you would no longer need to explicitly state the nanopore chemistry as an extra CLI argument when running mikrokondo. + +## Quality control report configuration +> **WARNING:** Tread carefully here, as this will require modification of the `nextflow.config` file. **Make sure you have saved a back up of your `nextflow.config` file before playing with these option** + +### QCReport field desciption +The section of interest is the `QCReport` fields in the params section of the `nextflow.config`. There are multiple sections with values that can be modified or you can add data for a different organism. The default values in the pipeline are set up for **Illumina data** so you may need to adjust settingS for Nanopore or Pacbio data. + +An example of the QCReport structure is shown below. With annotation describing the values. +>**NOTE:** The values below do not affect the running of the pipeline, these values only affect the final quality messages output by the pipeline. +``` +QCReport { + escherichia // Generic top level name fo the field, it is name is technically arbitrary but it nice field name keeps things organized + { + search = "Escherichia coli" // The phrase that is searched for in the species_top_hit field mentioned above. The search is for containment so if you wanted to look for E.coli and E.albertii you could just set the value too "Escherichia" + raw_average_quality = 30 // Minimum raw average quality of all bases in the sequencing data. This value is generated before the decontamination procedure. + min_n50 = 95000 // The minimum n50 value allowed from quast + max_n50 = 6000000 // The maximum n50 value allowed from quast + min_nr_contigs = 1 // the minimum number of contigs a sample is allowed to have, a value of 1 works as a sanity check + max_nr_contigs = 500 // The maximum number of contigs the organism in the search field is allowed to have. to many contigs could indicate a bad assembly or contamination + min_length = 4500000 // The minimum genome length allowed for the organism specified in the search field + max_length = 6000000 // The maxmimum genome length the organism in the search field is allowed to have + max_checkm_contamination = 3.0 // The maximum level of allowed contamination allowed by CheckM + min_average_coverage = 30 // The minimum average coverage allowed + } + // DO NOT REMOVE THE FALLTRHOUGH FIELD AS IT IS NEEDED TO CAPTURE OTHER ORGANISMS + fallthrough // The fallthrough field exist as a default value to capture organisms where no quality control data has been specified + { + search = "No organism specific QC data available." + raw_average_quality = 30 + min_n50 = null + max_n50 = null + min_nr_contigs = null + max_nr_contigs = null + min_length = null + max_length = null + max_checkm_contamination = 3.0 + min_average_coverage = 30 + } +} +``` + +### Example adding quality control data for *Salmonella* + +If you wanted to add quality control data for *Salmonella* you can start off by using the template below: + +``` +VAR_NAME { // Replace VAR name with the genus name of your sample, only use ASCII (a-zA-Z) alphabet characters in the name and replace spaces, punctuation and other special characters with underscores (_) + search = "Search phrase" // Search phrase for your species top_hit, Note the quotes + raw_average_quality = // 30 is a default value please change it as needed + min_n50 = // Set your minimum n50 value + max_n50 = // Set a maximum n50 value + min_nr_contigs = // Set a minimum number of contigs + max_nr_contigs = // The maximum number of contings + min_length = // Set a minimum genome length + max_length = // set a maximum genome length + max_checkm_contamination = // Set a maximum level of contamination to use + min_average_coverage = // Set the minimum coverage value +} +``` + +For *Salmonella* I would fill in the values like so. +``` +salmonella { + search = "Salmonella" + raw_average_quality = 30 + min_n50 = 95000 + max_n50 = 6000000 + min_nr_contigs = 1 + max_nr_contigs = 200 + min_length = 4400000 + max_length = 6000000 + max_checkm_contamination = 3.0 + min_average_coverage = 30 +} +``` + +After having my values filled out, I can simply add them to the QCReport section in the `nextflow.config` file. + +``` + QCReport { + escherichia { + search = "Escherichia coli" + raw_average_quality = 30 + min_n50 = 95000 + max_n50 = 6000000 + min_nr_contigs = 1 + max_nr_contigs = 500 + min_length = 4500000 + max_length = 6000000 + max_checkm_contamination = 3.0 + min_average_coverage = 30 + } salmonella { // NOTE watch the opening and closing brackets + search = "Salmonella" + raw_average_quality = 30 + min_n50 = 95000 + max_n50 = 6000000 + min_nr_contigs = 1 + max_nr_contigs = 200 + min_length = 4400000 + max_length = 6000000 + max_checkm_contamination = 3.0 + min_average_coverage = 30 + } + fallthrough { + search = "No organism specific QC data available." + raw_average_quality = 30 + min_n50 = null + max_n50 = null + min_nr_contigs = null + max_nr_contigs = null + min_length = null + max_length = null + max_checkm_contamination = 3.0 + min_average_coverage = 30 + } + } +``` + +## Quality Control Fields +This section affects the behaviours of the final summary quality control messages and is noted in the `QCReportFields` within the `nextflow.config`. **I would advise against manipulating this section unless you really know what you are doing**. + +TODO test what happens if no quality msg is available for the bool fields types. + +Each value in the QC report fields contains the following fields. + +- Field name + - path: path to the information in the summary report JSON + - coerce_type: Type to be coreced too, can be a Float, Integer, or Bool + - compare_fields: A list of fields corresponding to fields in the `QCReport` section of the `nextflow.config`. If two values are specified it will be assumed you wish to check that a value is in between a range of values. + - comp_type: The comparison type specified, 'ge' for greater or equal, 'le' for less than or equal, 'bool' for true or false or 'range' for checking if a value is between two values. + - on: A boolean value for disabling a comparison + - low_msg: A message for if a value is less than its compared value (optional) + - high_msg: A message for if value is above a certain value (optional) + +An example of what these fields look like is: + +``` +QCReportFields { + raw_average_quality { + path = [params.raw_reads.report_tag, "combined", "qual_mean"] + coerce_type = 'Float' + compare_fields = ['raw_average_quality'] + comp_type = "ge" + on = true + low_msg = "Base quality is poor, resequencing is recommended." + } +} +``` + diff --git a/docs/usage/examples.md b/docs/usage/examples.md index cf749121..01bd0383 100644 --- a/docs/usage/examples.md +++ b/docs/usage/examples.md @@ -1,34 +1,34 @@ -# Command Line Examples - -Some example commands of running mikrokondo are provided below: - -## Running paired-end illumina data skipping Bakta -`nextflow run main.nf --input sample_sheet.csv --skip_bakta true --platform illumina --outdir ../test_illumina -profile singularity -resume` - -The above command would run paired-end Illumina data, using Singulairty as a container service, using resume (e.g if picks up where the pipeline left off if being run again), skipping Bakta and outputting results in a folder called `test_illumina` one directory back from where the pipeline is run. **Note: your sample sheet does not need to be called sample_sheet.csv** - -## Running paired-end illumina data using Kraken2 for classifying the top species hit - -`nextflow run main.nf --input sample_sheet.csv --skip_bakta true --run_kraken true --platform illumina --outdir ../test_illumina_kraken -profile singularity -resume` - -The above command would run paired-end Illumina data, using Singulairty as a container service, using resume (e.g if picks up where the pipeline left off if being run again), skipping Bakta, using kraken2 to classify the species top hit and outputting results in a folder called `test_illumina_kraken` one directory back from where the pipeline is run. **Note: your sample sheet does not need to be called sample_sheet.csv** - -## Running nanopore data -`nextflow run main.nf --input sample_sheet.csv --skip_ont_header_cleaning true --nanopore_chemistry r941_min_hac_g507 --platform nanopore --outdir ../test_nanopore -profile docker -resume` - -The above command would run single-end Nanopore data using Docker as a container service, using resume (e.g if picks up where the pipeline left off if being run again), outputting data into a folder called `../test_nanopore` and skipping the process of verifying all Nanopore fastq data headers are unique. **Note: your sample sheet does not need to be called sample_sheet.csv** - -## Running a hybrid assembly using Unicycler -`nextflow run main.nf --input sample_sheet.csv --hybrid_unicycler true --nanopore_chemistry r941_min_hac_g507 --platform hybrid --outdir ../test_hybrid -profile apptainer -resume` - -The above command would run single-end Nanopore and paired-end Illumina data using apptainer as a container service, using resume (e.g if picks up where the pipeline left off if being run again), outputting data into a folder called `../test_hybrid` and using Unicycler for assembly. **Note: your sample sheet does not need to be called sample_sheet.csv** - -## Running a hybrid assembly without Unicycler -`nextflow run main.nf --input sample_sheet.csv --platform hybrid --outdir ../test_hybrid -profile singularity -resume` - -The above command would run single-end Nanopore and paired-end Illumina data using singularity as a container service, using resume (e.g if picks up where the pipeline left off if being run again), outputting data into a folder called `../test_hybrid`. **Note: your sample sheet does not need to be called sample_sheet.csv** - -## Running metagenomic Nanopore data -`nextflow run main.nf --skip_depth_sampling true --input sample_sheet.csv --skip_polishing true --skip_bakta true --metagenomic_run true --nanopore_chemistry r941_prom_hac_g507 --platform nanopore --outdir ../test_nanopore_meta -profile singularity -resume` - +# Command Line Examples + +Some example commands of running mikrokondo are provided below: + +## Running paired-end illumina data skipping Bakta +`nextflow run main.nf --input sample_sheet.csv --skip_bakta true --platform illumina --outdir ../test_illumina -profile singularity -resume` + +The above command would run paired-end Illumina data, using Singulairty as a container service, using resume (e.g if picks up where the pipeline left off if being run again), skipping Bakta and outputting results in a folder called `test_illumina` one directory back from where the pipeline is run. **Note: your sample sheet does not need to be called sample_sheet.csv** + +## Running paired-end illumina data using Kraken2 for classifying the top species hit + +`nextflow run main.nf --input sample_sheet.csv --skip_bakta true --run_kraken true --platform illumina --outdir ../test_illumina_kraken -profile singularity -resume` + +The above command would run paired-end Illumina data, using Singulairty as a container service, using resume (e.g if picks up where the pipeline left off if being run again), skipping Bakta, using kraken2 to classify the species top hit and outputting results in a folder called `test_illumina_kraken` one directory back from where the pipeline is run. **Note: your sample sheet does not need to be called sample_sheet.csv** + +## Running nanopore data +`nextflow run main.nf --input sample_sheet.csv --skip_ont_header_cleaning true --nanopore_chemistry r941_min_hac_g507 --platform nanopore --outdir ../test_nanopore -profile docker -resume` + +The above command would run single-end Nanopore data using Docker as a container service, using resume (e.g if picks up where the pipeline left off if being run again), outputting data into a folder called `../test_nanopore` and skipping the process of verifying all Nanopore fastq data headers are unique. **Note: your sample sheet does not need to be called sample_sheet.csv** + +## Running a hybrid assembly using Unicycler +`nextflow run main.nf --input sample_sheet.csv --hybrid_unicycler true --nanopore_chemistry r941_min_hac_g507 --platform hybrid --outdir ../test_hybrid -profile apptainer -resume` + +The above command would run single-end Nanopore and paired-end Illumina data using apptainer as a container service, using resume (e.g if picks up where the pipeline left off if being run again), outputting data into a folder called `../test_hybrid` and using Unicycler for assembly. **Note: your sample sheet does not need to be called sample_sheet.csv** + +## Running a hybrid assembly without Unicycler +`nextflow run main.nf --input sample_sheet.csv --platform hybrid --outdir ../test_hybrid -profile singularity -resume` + +The above command would run single-end Nanopore and paired-end Illumina data using singularity as a container service, using resume (e.g if picks up where the pipeline left off if being run again), outputting data into a folder called `../test_hybrid`. **Note: your sample sheet does not need to be called sample_sheet.csv** + +## Running metagenomic Nanopore data +`nextflow run main.nf --skip_depth_sampling true --input sample_sheet.csv --skip_polishing true --skip_bakta true --metagenomic_run true --nanopore_chemistry r941_prom_hac_g507 --platform nanopore --outdir ../test_nanopore_meta -profile singularity -resume` + The above command would run single-end Nanopore and paired-end Illumina data using singularity as a container service, using resume (e.g if picks up where the pipeline left off if being run again), outputting data into a folder call `../test_nanopore_meta`, all samples would be labeled treated as metagenomic, assembly polishing would be turned off and annotation of assemblies with Bakta would not be performed, depth sampling would not be performed either. **Note: your sample sheet does not need to be called sample_sheet.csv** \ No newline at end of file diff --git a/docs/usage/installation.md b/docs/usage/installation.md index 4459dc5b..9a1b81a9 100644 --- a/docs/usage/installation.md +++ b/docs/usage/installation.md @@ -1,60 +1,64 @@ -# Installation - -## Installing Nextflow -Nextflow is required to run mikrokondo, but fortunately it is not too hard to install (Linux is required). The instructions for installing Nextflow can be found at either either resource: [Nextflow Home](https://www.nextflow.io/) or [Nextflow Documentation](https://www.nextflow.io/docs/latest/getstarted.html#installation) - -## Container Engine -Nextflow and Mikrokondo only supports running the pipeline using containers such as: Docker, Singularity (now apptainer), podman, gitpod, sifter and charliecloud. Currently only usage with Singularity has been tested, but support for each of the container services exists. Note: Singularity was adopted by the Linux Foundation and is now called Apptainer. Singularity still exists, but it is likely newer installs will use Apptainer. - -## Docker or Singularity? -Docker or Singularity (Apptainer) Docker requires root privileges which can can make it a hassle to install on computing clusters (there are work arounds). Apptainer/Singularity does not, so running the pipeline using Apptainer/Singularity is the recommended method for running the pipeline. - -### Issues -Containers are not perfect, below is a list of some issues you may face using containers in mikrokondo, fixes for each issue will be detailed here as they are identified. -- Exit code 137, likely means your docker container used to much memory. - -## Dependencies -Besides the Nextflow run time (requires Java), and container engine the dependencies required by mikrokondo are fairly minimal requiring only Python 3.10 (more recent Python versions will work as well) to run. Currently mikrokondo has been tested with fully with Singularity (partially with Apptainer, containers all work not all workflow paths tested) and partially tested with Docker (not all workflow paths tested). **Dependencies can be installed with Conda (e.g. Nextflow and Python)**. To download the pipeline run: - -`git clone https://github.com/phac-nml/mikrokondo.git` - -### Dependencies listed - -- Python (3.10>=) -- Nextflow (22.10.1>=) -- Container service (Docker, Singularity, Apptainer have been tested) -- The source code: `git clone https://github.com/phac-nml/mikrokondo.git` - - - -## Resources to download -- [GTDB Mash Sketch](https://zenodo.org/record/8408361): required for speciation and determination if sample is metagenomic -- [Decontamination Index](https://zenodo.org/record/8408557): Required for decontamination of reads (it is simply a minimap2 index) -- [Kraken2 nt database](https://benlangmead.github.io/aws-indexes/k2): Required for binning of metagenommic data and is an alternative to using Mash for speciation -- [Bakta database](https://zenodo.org/record/7669534): Running Bakta is optional and there is a light database option, however the full one is recommended. You will have to unzip and un-tar the database for usage. - -### Fields to update with resources -The above downloadable resources must be updated in the following places in your `nextflow.config`. A good place to store them is within the `databases` folder in the mikrokondo folder, if you do so you can just simply update the name of the database. The spots to update in the params section of the `nextflow.config` are listed below: - -``` -// Bakta db path, note the quotation marks -bakta { - db = "/PATH/TO/BAKTA/DB" -} - -// Decontamination minimap2 index, note the quotation marks -r_contaminants { - mega_mm2_idx = "/PATH/TO/DECONTAMINATION/INDEX" -} - -// kraken db path, not the quotation marks -kraken { - db = "/PATH/TO/KRAKEN/DATABASE/" -} - -// GTDB Mash sketch, note the quotation marks -mash { - mash_sketch = "/PATH/TO/MASH/SKETCH/" -} - -``` +# Installation + +## Dependencies +- Python (3.10>=) +- Nextflow (22.10.1>=) +- Container service (Docker, Singularity, Apptainer have been tested) +- The source code: `git clone https://github.com/phac-nml/mikrokondo.git` + +**Dependencies can be installed with Conda (e.g. Nextflow and Python)**. + +## To install mikrokondo +Once all dependencies are installed (see below for instructions), to download the pipeline run: + +`git clone https://github.com/phac-nml/mikrokondo.git` + +## Installing Nextflow +Nextflow is required to run mikrokondo (requires Linux), and instructions for its installation can be found at either: [Nextflow Home](https://www.nextflow.io/) or [Nextflow Documentation](https://www.nextflow.io/docs/latest/getstarted.html#installation) + +## Container Engine +Nextflow and Mikrokondo require the use of containers to run the pipeline, such as: Docker, Singularity (now apptainer), podman, gitpod, sifter and charliecloud. + +> **NOTE:** Singularity was adopted by the Linux Foundation and is now called Apptainer. Singularity still exists, however newer installs will likely use Apptainer. + +## Docker or Singularity? +Docker requires root privileges which can can make it a hassle to install on computing clusters, while there are work arounds, Apptainer/Singularity does not. Therefore, using Apptainer/Singularity is the recommended method for running the mikrokondo pipeline. + +### Issues +Containers are not perfect, below is a list of some issues you may face using containers in mikrokondo, fixes for each issue will be detailed here as they are identified. + +- **Exit code 137,** usually means the docker container used to much memory. + +## Resources to download +- [GTDB Mash Sketch](https://zenodo.org/record/8408361): required for speciation and determination when sample is metagenomic +- [Decontamination Index](https://zenodo.org/record/8408557): Required for decontamination of reads (this is a minimap2 index) +- [Kraken2 std database](https://benlangmead.github.io/aws-indexes/k2): Required for binning of metagenommic data and is an alternative to using Mash for speciation +- [Bakta database](https://zenodo.org/record/7669534): Running Bakta is optional and there is a light database option, however the full one is recommended. You will have to unzip and un-tar the database for usage. + +### Fields to update with resources +It is recommended to store the above resources within the `databases` folder in the mikrokondo folder, this allows for a simple update to the names of the database in `nextflow.config` rather than a need for a full path description. + +Below shows where to update database resources in the `params` section of the `nextflow.config` file: + +``` +// Bakta db path, note the quotation marks +bakta { + db = "/PATH/TO/BAKTA/DB" +} + +// Decontamination minimap2 index, note the quotation marks +r_contaminants { + mega_mm2_idx = "/PATH/TO/DECONTAMINATION/INDEX" +} + +// kraken db path, not the quotation marks +kraken { + db = "/PATH/TO/KRAKEN/DATABASE/" +} + +// GTDB Mash sketch, note the quotation marks +mash { + mash_sketch = "/PATH/TO/MASH/SKETCH/" +} + +``` diff --git a/docs/usage/tool_params.md b/docs/usage/tool_params.md new file mode 100644 index 00000000..59b8f668 --- /dev/null +++ b/docs/usage/tool_params.md @@ -0,0 +1,466 @@ +# Tool Specific Parameters +To access tool specific parameters from the command line you must use the dot operator. For organization and readability sake, the below documentation is nested to indicate where the dot operator is used. For example: +``` +- quast + - min_contig_length NUM +``` +Translates to `--quast.min_contig_length NUM` on the CLI. + +>**Note:** Easily changed parameters are bolded. Sensible defaults are provided. + +### Abricate +Screens contigs for antimicrobial and virulence genes. If you wish to use a different Abricate database you may need to update the container you use. + +- abricate + - singularity: Abricate singularity container + - docker: Abricate docker container + - **args**: Can be a string of additional command line arguments to pass to abricate + - report_tag: determines the name of the Abricate output in the final summary file. **Do no touch this unless doing pipeline development.** + - header_p: This field tells the report module that the Abricate output contains headers. **Do no touch this unless doing pipeline development.** + +### Raw Read Metrics +A custom Python script that gathers quality metrics for each fastq file. + +- raw_reads + - high_precision: When set to true, floating point precision of values output are accurate down to very small decimal places. Recommended to leave this setting as false (use the standard floats), it is much faster and having such precise decimal places is not needed for this module. + - report_tag: this field determines the name of the Raw Read Metric field in the final summary report. **Do no touch this unless doing pipeline development.** + +### Coreutils +In cases where a process uses bash scripting only, Nextflow by default will utilize system binaries when they are available and no container is specified. For reproducability, we have chosen to use containers in such cases. When a better container is available, you can direct the pipeline to use it via below commands: + +- coreutils + - singularity: coreutils singularity container + - docker: coreutils docker container + + +### Python +Some scripts require Python3, therefore a well tested Python3 container is provided for reproducability. However, as all the scripts within mikrokondo use only the standard library you can swap these containers to use any python interpreter version. For instance, swapping in **pypy3** may result a massive performance boost from the scripts, though this is currently untested. + +- python3 + - singularity: Python3 singularity container + - docker: Python3 docker container + +### KAT +Kat was previously used to estimate genome size, however at the time of writing KAT appears to be only infrequently updated and newer versions would have issues running/sometimes giving an incorrect output due to failures in peak recognition. Therefore, KAT has been removed from the pipeline, It's code still remains but it **will be removed in the future**. + +### Seqtk +Seqtk is used for both the sub-sampling of reads and conversion of fasta files to fastq files in mikrokondo. The usage of seqtk to convert a fasta to a fastq is needed in certain typing tools requiring reads as input (this was a design decision for generalizability of the pipeline). + +- seqtk + - singularity: singularity container for seqtk + - docker: docker container for seqtk + - seed: A seed value for sub-sampling + - reads_ext: Extension of reads after sub-sampling, do not touch alter this unless doing pipeline development. + - assembly_fastq: Extension of the fastas after being converted to fastq files. Do no touch this unless doing pipeline development. + - report_tag: Name of seqtk data in the final summary report. Do no touch this unless doing pipeline development. + +### FastP +Fastp is fast and widely used program for gathering of read quality metrics, adapter trimming, read filtering and read trimming. FastP has extensive options for configuration which are detailed in their documentation, but sensible defaults have been set. **Adapter trimming in Fastp is performed using overlap analysis, however if you do not trust this you can specify the sequencing adapters used directly in the additional arguments for Fastp**. + +- fastp + - singulartiy: singularity container for FastP + - docker: docker container for FastP + - fastq_ext: extension of the output Fastp trimmed reads, do not touch this unless doing pipeline development. + - html_ext: Extension of the html report output by fastp, do no touch unless doing pipeline development. + - json_ext: Extension of json report output by FastP do not touch unless doing pipeline development. + - report_tag: Title of FastP data in the summary report. + - **average_quality_e**: If a read/read-pair quality is less than this value it is discarded + - **cut_mean_quality**: The quality to trim reads too + - **qualified_quality_phred**: the quality of a base to be qualified if filtering by unqualified bases + - **unqualified_percent_limit**: The percent amount of bases that are allowed to be unqualified in a read. This parameter is affected by the above qualified_quality_phred parameter + - **illumina_length_min**: The minimum read length to be allowed in illumina data + - **single_end_length_min**: the minimum read length allowed in Pacbio or Nanopore data + - **dedup_reads**: A parameter to be turned on to allow for deduplication of reads. + - **illumina_args**: The command string passed to Fastp when using illumina data, if you override this parameter other set parameters such as average_quality_e must be overridden as well as the command string will be passed to FastP as written + - **single_end_args**: The command string passed to FastP if single end data is used e.g. Pacbio or Nanopore data. If this option is overridden you must specify all parameters passed to Fastp as this string is passed to FastP as written. + - report_exclude_fields: Fields in the summary json to be excluded from the final aggregated report. Do not alter this field unless doing pipeline development + +### Chopper +Chopper was originally used for trimming of Nanopore reads, but FastP was able to do the same work so Chopper is no longer used. Its code currently remains but it cannot be run in the pipeline. + +### Flye +Flye is used for assembly of Nanopore data. + +- flye + - nanopore + - raw: corresponds to the option in Flye of `--nano-raw` + - corr: corresponds to the option in Flye of `--nano-corr` + - hq: corresponds to the option in Flye of `--nano-hq` + - pacbio + - raw: corresponds to the option in Flye of `--pacbio-raw` + - corr: corresponds to the option in Flye of `--pacbio-corr` + - hifi: corresponds to the option in Flye of `--pacbio-hifi` + - singularity: Singularity container for Flye + - docker: Docker container for Flye + - fasta_ext: The file extension for fasta files. Do not alter this field unless doing pipeline development + - gfa_ext: The file extension for gfa files. Do not alter this field unless doing pipeline development + - gv_ext: The file extension for gv files. Do not alter this field unless doing pipeline development + - txt_ext: the file extension for txt files. Do not alter this field unless doing pipeline development + - log_ext: the file extension for the Flye log files. Do not alter this field unless doing pipeline development + - json_ext: the file extension for the Flye json files. Do not alter this field unless doing pipeline development + - **polishing_iterations**: The number of polishing iterations for Flye. + - ext_args: Extra commandline options to pass to Flye + +### Spades +Usef for paired end read assembly + +- spades + - singularity: Singularity container for spades + - docker: Docker container for spades + - scaffolds_ext: The file extension for the scaffolds file. Do not alter this field unless doing pipeline development + - contigs_ext: The file extension containing assembled contigs. Do not alter this field unless doing pipeline development + - transcripts_ext: The file extension for the assembled transcripts. Do not alter this field unless doing pipeline development + - assembly_graphs_ext: the file extension of the assembly graphs. Do not alter this field unless doing pipeline development + - log_ext: The file extension for the log files. Do not alter this field unless doing pipeline development + - outdir: The name of the output directory for assemblies. Do not alter this field unless doing pipeline development + +### FastQC +This is a defualt tool added to nf-core pipelines. This feature will likely be removed in the future but for those fond of it, the outputs of FastQC still remain. + +- fastqc + - html_ext: The file extension of the fastqc html file. Do not alter this field unless doing pipeline development + - zip_ext: The file extension of the zipped FastQC outputs. Do not alter this field unless doing pipeline development + +### Quast +Quast is used to gather assembly metrics which automated quality control criteria are the applied too. + +- quast + - singularity: Singularity container for quast. + - docker: Docker container for quast. + - suffix: The suffix attached to quast outputs. Do not alter this field unless doing pipeline development. + - report_base: The base term for output quast files to be used in reporting. Do not alter this field unless doing pipeline development. + - report_prefix: The prefix of the quast outputs to be used in reporting. Do not alter this field unless doing pipeline development. + - **min_contig_length**: The minimum length of for contigs to be used in quasts generation of metrics. Do not alter this field unless doing pipeline development. + - **args**: A command string to past to quast, altering this is unadvised as certain options may affect your reporting output. This string will be passed to quast verbatim. Do not alter this field unless doing pipeline development. + - header_p: This tells the pipeline that the Quast report outputs contains a header. Do not alter this field unless doing pipeline development. + +### Quast Filter +Assemblies can be prevented from going into further analyses based on the Quast output. The options for the mentioned filter are listed here. + +- quast_filter + - n50_field: The name of the field to search for and filter. Do not alter this field unless doing pipeline development. + - n50_value: The minimum value the field specified is allowed to contain. + - nr_contigs_field: The name of field in the Quast report to fiter on. Do not alter this field unless doing pipeline development. + - nr_contigs_value: The minimum number of contigs an assembly must have to proceed further through the pipeline. + - sample_header: The column name in the Quast output containing the sample information. Do not alter this field unless doing pipeline development. + +### CheckM +CheckM is used within the pipeline for assesing contamination in assemblies. + +- checkm + - singularity: Singularity container containing CheckM + - docker: Docker container containing CheckM + - alignment_ext: Extension on the genes alignment within CheckM. Do not alter this field unless doing pipeline development. + - results_ext: The extension of the file containing the CheckM results. Do not alter this field unless doing pipeline development. + - tsv_ext: The extension containing the tsv results from CheckM. Do not alter this field unless doing pipeline development. + - folder_name: The name of the folder containing the outputs from CheckM. Do not alter this field unless doing pipeline development. + - gzip_ext: The compression extension for CheckM. Do not alter this field unless doing pipeline development. + - lineage_ms: The name of the lineages.ms file output by CheckM. Do not alter this field unless doing pipeline development. + - threads: The number of threads to use in CheckM. Do not alter this field unless doing pipeline development. + - report_tag: The name of the CheckM data in the summary report. Do not alter this field unless doing pipeline development. + - header_p: Denotes that the result used by the pipeline in generation of the summary report contains a header. Do not alter this field unless doing pipeline development. + +### Kraken2 +Kraken2 can be used a substitute for mash in speciation of samples, and it is used to bin contigs of metagenomic samples. + +- kraken + - singularity: Singularity container for the Kraken2. + - docker: Docker container for Kraken2. + - classified_suffix: Suffix for classified data from Kraken2. Do not alter this field unless doing pipeline development. + - unclassified_suffix: Suffic for unclassified data from Kraken2. Do not alter this field unless doing pipeline development. + - report_suffix: The name of the report output by Kraken2. + - output_suffix: The name of the output file from Kraken2. Do not alter this field unless doing pipeline development. + - **tophit_level**: The taxonomic level to classify a sample at. e.g. default is `S` for species but you could use `S1` or `F`. + - save_output_fastqs: Option to save the output fastq files from Kraken2. Do not alter this field unless doing pipeline development. + - save_read_assignments: Option to save how Kraken2 assigns reads. Do not alter this field unless doing pipeline development. + - **run_kraken_quick**: This option can be set to `true` if one wishes to run Kraken2 in quick mode. + - report_tag: The name of the Kraken2 data in the final report. Do not alter this field unless doing pipeline development. + - header_p: Tell the pipeline that the file used for reporting does or does not contain header data. Do not alter this field unless doing pipeline development. + - headers: A list of headers in the Kraken2 report. Do not alter this field unless doing pipeline development. + +### Seven Gene MLST +Run Torstein Tseemans seven gene MLST program. + +- mlst + - singularity: Singularity container for mlst. + - docker: Docker container for mlst. + - **args**: Addtional arguments to pass to mlst. + - tsv_ext: Extension of the mlst tabular file. Do not alter this field unless doing pipeline development. + - json_ext: Extension of the mlst output JSON file. Do not alter this field unless doing pipeline development. + - report_tag: Name of the data outputs in the final report. Do not alter this field unless doing pipeline development. + +### Mash +Mash is used repeatedly througout the pipeline for estimation of genome size from reads, contamination detection and for determining the final species of an assembly. + +- mash + - singularity: Singularity container for mash. + - docker: Docker container for mash. + - mash_ext: Extension of the mash screen file. Do not alter this field unless doing pipeline development. + - output_reads_ext: Extension of mash outputs when run on reads. Do not alter this field unless doing pipeline development. + - output_taxa_ext: Extension of mash output when run on contigs. Do not alter this field unless doing pipeline development. + - mash_sketch: The GTDB sketch used by the pipeline, this sketch is special as it contains the taxonomic paths in the classification step of the pipeline. It can as of 2023-10-05 be found here: https://zenodo.org/record/8408361 + - sketch_ext: File extension of a mash sketch. Do not alter this field unless doing pipeline development. + - json_ext: File extension of json data output by Mash. Do not alter this field unless doing pipeline development. + - sketch_kmer_size: The size of the kmers used in the sketching in genome size estimation. + - **min_kmer**: The minimum number of kmer copies required to pass the noise filter. this value is used in estimation of genome size from reads. The default value is 10 as it seems to work well for Illumina data. + - final_sketch_name: **to be removed** This parameter was originally part of a subworkflow included in the pipeline for generation of the GTDB sketch. But this has been removed and replaced with scripting. + - report_tag: Report tag for Mash in the summary report. Do not alter this field unless doing pipeline development. + - header_p: Tells the pipeline if the output data contains headers. Do not alter this field unless doing pipeline development. + - headers: A list of the headers the output of mash should contain. Do not alter this field unless doing pipeline development. + +### Mash Meta +This process is used to determine if a sample is metagenomic or not. + +- mash_meta. + - report_tag: The name of this output field in the summary report. Do not alter this field unless doing pipeline development. + +### top_hit_species: +As Kraken2 of Mash can be used for determining the species present in the pipeline, the share a common report tag. + +- top_hig_species + - report_tag: The name of the determined species in the final report. Do not alter this field unless doing pipeline development. + +### Contamination Removal +This step is used to remove contaminants from read data, it exists to perform dehosting, and removal of kitomes. + +- r_contaminants + - singularity: Singularity container used to perform dehosting, this container contains minimap2 and samtools. + - docker: Docker container used to perform dehosting, this container contains minimap2 and samtools. + - phix_fa: The path to file containing the phiX fasta. + - homo_sapiens_fa: The path to file containing the human genomes fasta. + - pacbio_mg: The path to file containg the pacbio sequencing control. + - output_ext: The extension of the deconned fastq files. Do not alter this field unless doing pipeline development. + - mega_mm2_idx: The path to the minimap2 index used for dehosting. Do not alter this field unless doing pipeline development. + - mm2_illumina: The arguments passed to minimap2 for Illumina data. Do not alter this field unless doing pipeline development. + - mm2_pac: The arguments passed to minimap2 for Pacbio Data. Do not alter this field unless doing pipeline development. + - mm2_ont: The arguments passed to minimap2 for Nanopore data. Do not alter this field unless doing pipeline development. + - samtools_output_ext: The extension of the output from samtools. Do not alter this field unless doing pipeline development. + - samtools_singletons_ext: The extension of singelton reads from samtools. Do not alter this field unless doing pipeline development. + - output_ext: The name of the files output from samtools. Do not alter this field unless doing pipeline development. + - output_dir: The directory where deconned reads are placed. Do not alter this field unless doing pipeline development. + +### Minimap2 +Minimap2 is used frequently throughout the pipeline for decontamination and mapping reads back to assemblies for polishing. + +- minimap2 + - singularity: The singularity container for minimap2, the same one is used for contmaination removal. + - docker: The Docker container for minimap2, the same one is used for contmaination removal. + - index_outdir: The directory where created indices are output. Do not alter this field unless doing pipeline development. + - index_ext: The file extension of create indices. Do not alter this field unless doing pipeline development. + +### Samtools +Samtools is used for sam to bam conversion in the pipeline. + +- samtools + - singularity: The Singularity container containing samtools, the same container is used as the one in contamination removal. + - docker: The Docker container containing samtools, the same container is used as the on in contamination removal. + - bam_ext: The extension of the bam file from samtools. Do not alter this field unless doing pipeline development. + - bai_ext: the extension of the bam index from samtools. Do not alter this field unless doing pipeline development. + +### Racon +Racon is used as a first pass for polishing assemblies. + +- racon + - singularity: The Singularity container containing racon. + - docker: The Docker container containing racon. + - consensus_suffix: The suffix for racons outputs. Do not alter this field unless doing pipeline development. + - consensus_ext: The file extension for the racon consensus sequence. Do not alter this field unless doing pipeline development. + - outdir: The directory containing the polished sequences. Do not alter this field unless doing pipeline development. + +### Pilon +Pilon was added to the pipeline, but it is run iteratively which at the time of writing this pipeline was not well supported in Nextflow so a seperate script and containers are provided to utilize Pilon. The code for Pilon remains in the pipeline so that when able to do so easily, iterative Pilon polishing can be integrated directly into the pipeline. + +### Pilon Iterative Polishing +This process is a wrapper around minimap2, samtools and Pilon for iterative polishing containers are built **but if you ever have problems with this step, disabling polishing will fix your issue (at the cost of polishing)**. + +- pilon_iterative + - singularity: The container containing the iterative pilon program. If you ever have issues with the singularity image you can use the Docker image as Nextflow will automatically convert the docker image into a singularity image. + - docker: The Docker container for the Pilon iterative polisher. + - outdir: The directory where polished data is output. Do not alter this field unless doing pipeline development. + - fasta_ext: File extension for the fasta to be polished. Do not alter this field unless doing pipeline development. + - fasta_outdir: The output directory name for the polished fastas. Do not alter this field unless doing pipeline development. + - vcf_ext: File extension for the VCF output by Pilon. Do not alter this field unless doing pipeline development. + - vcf_outdir: output directory containing the VCF files from Pilon. Do not alter this field unless doing pipeline development. + - bam_ext: Bam file extension. Do not alter this field unless doing pipeline development. + - bai_ext: Bam index file extension. Do not alter this field unless doing pipeline development. + - changes_ext: File extensions for the pilon output containing the changes applied to the assembly. Do not alter this field unless doing pipeline development. + - changes_outdir: The output directory for the pilon changes. Do not alter this field unless doing pipeline development. + - max_memory_multiplier: On failure this program will try again with more memory, the mulitplier is the factor that the amount of memory passed to the program will be increased by. Do not alter this field unless doing pipeline development. + - **max_polishing_illumina**: Number of iterations for polishing an illuina assembly with illumina reads. + - **max_polishing_nanopre**: Number of iterations to polish a Nanopore assembly with (will use illumina reads if provided). + - **max_polishing_pacbio**: Number iterations to polish assembly with (will use illumina reads if provided). + +### Medaka Polishing +Medaka is used for polishing of Nanopore assemblies, make sure you specify a medaka model when using the pipeline so the correct settings are applied. If you have issues with Medaka running, try disabling resume or alternatively **disable polishing** as Medaka can be troublesome to run. + +- medaka + - singularity: Singularity container with Medaka. + - docker: Docker container with Medaka. + - model: This parameter will be autofilled with the model specified at the top level by the `nanopore_chemistry` option. Do not alter this field unless doing pipeline development. + - fasta_ext: Polished fasta output. Do not alter this field unless doing pipeline development. + - batch_size: The batch size passed to medaka, this can improve performance. Do not alter this field unless doing pipeline development. + +### Unicycler +Unicycler is an option provided for hybrid assembly, it is a great option and outputs an excellent assembly but it requires **A lot** of resources. Which is why the alternate hybrid assembly option using Flye->Racon->Pilon is available. As well there can be a fairly cryptic Spades error generated by Unicycler that usaully relates to memory usage, it will typically say something involving `tputs`. + +- unicycler + - singularity: The Singularity container containing Unicycler. + - docker: The Docker container containing Unicycler. + - scaffolds_ext: The scaffolds file extension output by unicycler. Do not alter this field unless doing pipeline development. + - assembly_ext: The assembly extension output by Unicycler. Do not alter this field unless doing pipeline development. + - log_ext: The log file output by Unicycler. Do not alter this field unless doing pipeline development. + - outdir: The output directory the Unicycler data is sent to. Do not alter this field unless doing pipeline development. + - mem_modifier: Specifies a high amount of memory for Unicycler to prevent a common spades error that is fairly cryptic. Do not alter this field unless doing pipeline development. + - threads_increase_factor: Factor to increase the number of threads passed to Unicycler. Do not alter this field unless doing pipeline development. + + +### Mob-suite Recon +mob-suite recon provides annotation of plasmids in the assembly data. + +- mobsuite_recon + - singularity: The singularity container containing mob-suite recon. + - docker: The Docker container containing mob-suite recon. + - **args**: Additional arguments to pass to mobsuite. + - fasta_ext: The file extension for FASTAs. Do not alter this field unless doing pipeline development. + - results_ext: The file extension for results in mob-suite. Do not alter this field unless doing pipeline development. + - mob_results_file: The final results to be included in the final report by mob-suite. Do not alter this field unless doing pipeline development. + - report_tag: The field name of mob-suite data in the final report. Do not alter this field unless doing pipeline development. + - header_p: Default is `true` and indicates that the results output by mob-suite contains a header. Do not alter this field unless doing pipeline development. + +## StarAMR +StarAMR provides annotation of antimicrobial resistance genes within your data. The process will alter FASTA headers of input files to ensure the header length <50 characters long. + +- staramr + - singularity: The singularity container containing staramr. + - docker: The Docker container containing starmar. + - **db**: The database for StarAMR. The default value of `null` tells the pipeline to use the database included in the StarAMR container. However you can specify a path to a valid StarAMR datbase and use that instead. + - tsv_ext: File extension of the reports from StarAMR. Do not alter this field unless doing pipeline development. + - txt_ext: File extension of the text reports from StarAMR. Do not alter this field unless doing pipeline development. + - xlsx_ext: File extension of the excel spread sheet from StarAMR. Do not alter this field unless doing pipeline development. + - **args**: Additional arguments to pass to StarAMR. Do not alter this field unless doing pipeline development. + - point_finder_dbs: A list containing the valid databases StarAMR supports for pointfinder. The way they are structured matches what StarAMR needs for input. Do not alter this field unless doing pipeline development. Do not alter this field unless doing pipeline development. + - report_tag: The field name of StarAMR in the final summary report. Do not alter this field unless doing pipeline development. + - header_p: Indicates the final report from StarAMR contains a header line. Do not alter this field unless doing pipeline development. + +### Bakta +Bakta is used to provide annotation of genomes, it is very reliable but it can be slow. + +- bakta + - singularity: The singularity container containing Bakta. + - docker: The Docker container containing Bakta. + - **db**: the path where the downloaded Bakta database should be downloaded. + - output_dir: The name of the folder where Bakta data is saved too. Do not alter this field unless doing pipeline development. + - embl_ext: File extension of embl file. Do not alter this field unless doing pipeline development. + - faa_ext: File extension of faa file. Do not alter this field unless doing pipeline development. + - ffn_ext: File extension of the ffn file. Do not alter this field unless doing pipeline development. + - fna_ext: File extension of the fna file. Do not alter this field unless doing pipeline development. + - gbff_ext: File extension of gbff file. Do not alter this field unless doing pipeline development. + - gff_ext: File extension of GFF file. Do not alter this field unless doing pipeline development. + - threads: Number of threads for Bakta to use, remember more is not always better. Do not alter this field unless doing pipeline development. + - hypotheticals_tsv_ext: File extension for hypothetical genes. Do not alter this field unless doing pipeline development. + - hypotheticals_faa_ext: File extension of hypothetical genes fasta. Do not alter this field unless doing pipeline development. + - tsv_ext: The file extension of the final bakta tsv report. Do not alter this field unless doing pipeline development. + - txt_ext: The file extension of the txt report. Do not alter this field unless doing pipeline development. + - min_contig_length: The minimum contig length to be annotated by Bakta. + +### Bandage +Bandage is included to make bandage plots of the initial assemblies e.g. Spades, Flye or Unicycler. These images can be useful in determining the quality of an assembly. + +- bandage + - singularity: The path to the singularity image containing bandage. + - docker: The path to the docker file containing bandage. + - svg_ext: The extension of the SVG file created by bandage. Do not alter this field unless doing pipeline development. + - outdir: The output directory of the bandage images. + +### Subtyping Report +All sub typing report tools contain a common report tag so that they can be identified by the program. + +- subtyping_report + - report_tag: Subtyping report name. Do not alter this field unless doing pipeline development. + +### ECTyper +ECTyper is used to perform *in-silico* typing of *Escherichia coli* and is automatically triggered by the pipeline. + +- ectyper + - singularity: The path to the singularity container containing ECTyper. + - docker: The path to the Docker container containing ECTyper. + - log_ext: File extension of the ECTyper log file. Do not alter this field unless doing pipeline development. + - tsv_ext: File extension of the ECTyper text file. Do not alter this field unless doing pipeline development. + - txt_ext: Text file extension of ECTyper output. Do not alter this field unless doing pipeline development. + - report_tag: Report tag for ECTyper data. Do not alter this field unless doing pipeline development. + - header_p: denotes if the table output from ECTyper contains a header. Do not alter this field unless doing pipeline development. + +### Kleborate +Kleborate performs automatic typing of *Kelbsiella*. + +- kleborate + - singularity: The path to the singularity container containing Kleborate. + - docker: The path to the docker container containing Kleborate. + - txt_ext: The subtyping report tag for Kleborate. Do not alter this field unless doing pipeline development. + - report_tag: The report tag for Kleborate. Do not alter this field unless doing pipeline development. + - header_p: Denotes the Kleborate table contains a header. Do not alter this field unless doing pipeline development. + +### Spatyper +Performs typing of *Staphylococcus* species. + +- spatyper + - singularity: The path to the singularity container containing Spatyper. + - docker: The path to docker container containing Spatyper. + - tsv_ext: The file extension of the Spatyper output. Do not alter this field unless doing pipeline development. + - report_tag: The report tag for Spatyper. Do not alter this field unless doing pipeline development. + - header_p: denotes whether or not the output table contains a header. Do not alter this field unless doing pipeline development. + - repeats: An optional file specifying repeats can be passed to Spatyper. + - repeat_order: An optional file containing a repeat ordet to pass to Spatyper. + +### SISTR +*In-silico Salmonella* serotype prediction. + +- sistr + - singularity: The path to the singularity container containing SISTR. + - docker: The path to the Docker container containing SISTR. + - tsv_ext: The file extension of the SISTR output. Do not alter this field unless doing pipeline development. + - allele_fasta_ext: The extension of the alleles identified by SISTR. Do not alter this field unless doing pipeline development. + - allele_json_ext: The extension to the output JSON file from SISTR. Do not alter this field unless doing pipeline development. + - cgmlst_tag: The extension of the CGMLST file from SISTR. Do not alter this field unless doing pipeline development. + - report_tag: The report tag for SISTR. Do not alter this field unless doing pipeline development. + - header_p: Denotes whether or not the output table contains a header. Do not alter this field unless doing pipeline development. + +### Lissero +*in-silico Listeria* typing. + +- lissero + - singularity: The path to the singularity container containing Lissero. + - docker: The path to the docker container containing Lissero. + - tsv_ext: The file extension of the Lissero output. Do not alter this field unless doing pipeline development. + - report_tag: The report tag for Lissero. Do not alter this field unless doing pipeline development. + - header_p: Denotes if the output table of Lissero contains a header. Do not alter this field unless doing pipeline development. + +### Shigeifinder +*in-silico Shigella* typing. +>**NOTE:** It is unlikely this subtyper will be triggered as GTDB has merged *E.coli* and *Shigella* in an updated sketch. An updated version of ECTyper will be released soon to address the shortfalls of this sketch. If you are relying on *Shigella* detection add `--run_kraken true` to your command line or update the value in the `.nextflow.config` as Kraken2 (while slower) can still detect *Shigella*. + +- shigeifinder + - singularity: The Singularity container containing Shigeifinder. + - docker: The path to the Docker container containing Shigeifinder. + - container_version: The version number **to be updated with the containers** as Shigeifinder does not currently have a version number tracked in the command. + - tsv_ext: Extension of output report. + - report_tag: The name of the output report for shigeifinder. + - header_p: Denotes that the output from Shigeifinder includes header values. + + +### Shigatyper (Replaced with Shigeifinder) +Code still remains but it will likely be removed later on. + +- shigatyper + - singularity: The Singularity container containing Shigatyper. + - docker: The path to the Docker container containing Shigatyper. + - tsv_ext: The tsv file extension. Do not alter this field unless doing pipeline development. + - report_tag: The report tag for Shigatyper. Do not alter this field unless doing pipeline development. + - header_p: Denotes if the report output contains a header. Do not alter this field unless doing pipeline development. + +### Kraken2 Contig Binning +Bins contigs based on the Kraken2 output for contaminated/metagenomic samples. This is implemeted by using a custom script. + +- kraken_bin + - **taxonomic_level**: The taxonomic level to bin contigs at. Binning at species level is not recommended the default is to bin at a genus level which is specied by a character of `G`. To bin at a higher level such as family you would specify `F`. + - fasta_ext: The extension of the fasta files output. Do not alter this field unless doing pipeline development. diff --git a/docs/usage/useage.md b/docs/usage/useage.md new file mode 100644 index 00000000..5063519f --- /dev/null +++ b/docs/usage/useage.md @@ -0,0 +1,120 @@ +# Running MikroKondo + +### Samplesheet +Mikrokondo requires a sample sheet to be run. This FOFN (file of file names) contains the samples names and allows a user to combine read-sets based on that name if provided. The sample-sheet can utilize the following header fields: + +- sample +- fastq_1 +- fastq_2 +- long_reads +- assembly + +**The sample sheet must be in csv format and sample files must be gzipped** + +Example layouts for different sample-sheets include: + +_Illumina paired-end data_ + +|sample|fastq_1|fastq_2| +|------|-------|-------| +|sample_name|path_to_forward_reads|path_to_reversed_reads| + +_Nanopore_ + +|sample|long_reads| +|------|----------| +|sample_name|path_to_reads| + +_Hybrid Assembly_ + +|sample|fastq_1|fastq_2|long_reads| +|-------|-------|------|----------| +|sample_name|path_to_forward_reads|path_to_reversed_reads|path_to_long_reads| + +_Starting with assembly only_ + +|sample|assembly| +|------|--------| +|sample_name|path_to_assembly| + +## Useage + +MikroKondo can be run like most other nextflow pipelines. The most basic usage is as follows: +`nextflow run main.nf --input PATH_TO_SAMPLE_SHEET --outdir OUTPUT_DIR --platform SEQUENCING_PLATFORM -profile CONTAINER_TYPE` + +Many parameters can be altered or accessed from the command line. For a full list of parameters to be altered please refer to the `nextflow.config` file in the repo. + +> **Note:** All the below settings can be permanently changed in the `nextflow.config` file within the `params` section. For example, to permanently set a nanopore chemistry and use Kraken for speciation: +``` +--run_kraken = true // Note the lack of quotes +--nanopore_chemistry "r1041_e82_400bps_hac_v4.2.0" // Note the quotes used here +``` + +### Common command line arguments + +#### Nf-core boiler plate options + +- `--publish_dir_mode`: Method used to save pipeline results to output directory +- `--email`: Email address for completion summary. +- `--email_on_fail`: An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully. +- `--plaintext_email`: Send plain-text email instead of HTML. +- `--monochrome_logs`: Do not use coloured log outputs. +- `--hook_url`: Incoming hook URL for messaging service. Currently, MS Teams and Slack are supported. +- `--help`: Display help text. +- `--version`: Display version and exit. +- `--validate_params`: Boolean whether to validate parameters against the schema at runtime. +- `--show_hidden_params`: By default, parameters set as _hidden_ in the schema are not shown on the command line when a user runs with `--help`. Specifying this option will tell the pipeline to show all parameters. + +#### General tool options +- `--fly_read_type VALUE`: Flye allows for different assembly options. The default value is set too `hq` (High quality for Nanopore reads, and HiFi for bacterial reads). User options include `hq`, `corr` and `raw`, and a default value can be specified in the `nextflow.config` file. +- `--hybrid_unicycler true`: to use unicycler for assembly instead of Flye->Racon>Pilon. + >**Note:** You may need to check the `conf/base.config` `process_high_memory` declaration and provide it upwards of 1000GB of memory if you get errors mentioning `tputs`. This error is not very clear sadly but increasing resources available to the process will help. +- `--metagenomic_run true`: users can specify samples are metagenomic via this flag, the pipeline will skip running the contamination mash check and proceed with metagenomic assembly. +- `--min_reads NUM`: refers to the minimum number of reads required after the fastP step to progress a set of sample reads to assembly (default: 1000). +- `--nanopore_chemistry YOUR_MODEL_HERE`: a Medaka model must be specified for polishing. A list of allowed models can be found here: [Medaka models python script](https://github.com/nanoporetech/medaka/blob/master/medaka/options.py) or [Medaka models available for download](https://github.com/nanoporetech/medaka/tree/master/medaka/data) +- `--run_kraken TRUE`: can be used to enable Kraken2 for speciation instead of Mash. +- `--target_depth`: refers to the target bp depth for a set of reads. When sample read sets have an estimated depth higher than this target, it is downsampled to achieve the depth. No downsampling occurs when estimated depth is lower than this value (default 100). + + +#### Skip Options + +Numerous steps within mikrokondo can be turned off without compromising the stability of the pipeline. This skip options can reduce run-time of the pipeline or allow for completion of the pipeline despite errors. +** All of the above options can be turned on by entering `--{skip_option} true` in the command line arguments to the pipeline (where optional parameters can be added)** + +- `--skip_abricate`: turn off abricate AMR detection +- `--skip_bakta`: turn off bakta annotation pipeline (generally a slow step, requiring a database to be specified). +- `--skip_checkm`: used as part of the contamination detection within mikrokondo, its run time and resource usage can be quite lengthy. +- `--skip_depth_sampling`: genome size of reads is estimated using mash and reads can be down-sampled to target depth in order to have a better assembly, if this is of no interest to you, this flag will skip this step entirely. **If you have specified that your run is metagenomic, down sampling is turned off.** +- `--skip_mobrecon`: turn off mob-suite recon. +- `--skip_ont_header_cleaning`: Nanopore data may fail in the pipeline due to duplicate headers, while rare it can cause assemblies to fail. Unlike the other options on this list, skipping header cleaning is defaulted to TRUE. +- `--skip_polishing`: if running a metagenomic assembly or encountering issues with polishing steps, this flag will disable polishing and retrieve assembly directly from Spades/Fly. **this does not apply to hybrid assemblies.** +- `--skip_report`: prevents the generation of the final summary report. +- `--skip_species_classification`: prevents Mash or Kraken2 being run on assembled genome, also **prevents subtyping workflow from triggering.** +- `--skip_starmar`: turn off starAMR AMR detection. +- `--skip_subtyping`: to turn off automatic triggering of subtyping in the pipeline (useful when target organism does not have a subtyping tool installed within mikrokondo). +- `--skip_version_gathering`: prevents the collation of tool versions. This process generally takes a couple minutes (at worst) but can be useful when during recurrent runs of the pipeline (like when testing settings). + +#### Containers + +Different container services can be specified from the command line when running mikrokondo in the `-profile` option. This option is specified at the end of your command line argument. Examples of different container services are specified below: + +- For Docker: `nextflow run main.nf MY_OPTIONS -profile docker` +- For Singularity: `nextflow run main.nf MY_OPTIONS -profile singularity` +- For Apptainer: `nextflow run main.nf MY_OPTIONS -profile apptainer` +- For Shifter: `nextflow run main.nf MY_OPTIONS -profile shifter` +- For Charliecloud: `nextflow run main.nf MY_OPTIONS -profile charliecloud` +- For Gitpod: `nextflow run main.nf MY_OPTIONS -profile gitpod` +- For Podman: `nextflow run main.nf MY_OPTIONS -profile podman` + +#### Platform specification + +- `--platform illumina` for Illumina. +- `--platform nanopore` for Nanopore. +- `--platform pacbio` for Pacbio +- `--platform hybrid` for hybrid assemblies. + > **Note:** when denoting your run as using a hybrid platform, you must also add in the long_read_opt parameter as the defualt value is nanopore**. `--long_read_opt nanopore` for nanopore or `--long_read_opt pacbio` for pacbio. + +#### Slurm options + +- `slurm_p true`: slurm execurtor will be used. +- `slurm_profile STRING`: a string to allow the user to specify which slurm partition to use. diff --git a/docs/workflows/CleanAssemble.md b/docs/workflows/CleanAssemble.md index 57970d38..defd6e6c 100644 --- a/docs/workflows/CleanAssemble.md +++ b/docs/workflows/CleanAssemble.md @@ -1,38 +1,39 @@ -# Clean Assemble -## workflows/local/CleanAssemble - -## Included sub-workflows - -- `input_check.nf` -- `clean_reads.nf` -- `assemble_reads.nf` -- `hybrid_assembly.nf` -- `polish_assemblies.nf` - -## Steps -1. **QC reads** subworkflow steps in brief are listed below, for further information see (clean_reads.nf) - - Reads are checked for known sequencing contamination - - Quality metrics are calculated - - Reads are trimmed - - Coverage is estimated - - Sample is subsampled to set level (OPTIONAL) - - Read set is assessed to be either an isolate or metagenomic sample (from presence of multiple taxa) - -2. **Assemble reads** using the '' flag, read sets will be diverted to either the assemble_reads (short reads) or hybrid_assembly (short and/or long reads) workflow. Though the data is handled differently in eash subworklow, both generate a contigs file and a bandage image and have an option of initial polishing via Racon. See (assemble_reads.nf) and (hybrid_assembly.nf) subworkflow pages for more details. - -3. **Polish assembles** (OPTIONAL) Polishing of contigs can be added (polish_assemblies.nf). To make changes to the default workflow, see setting 'optional flags' page - -## Input -- Next generation sequencing reads: - + Short read - Illumina - + Long read: - * Nanopore - * Pacbio - -## Output -- quality trimmed and deconned reads (fastq) -- estimated genome size -- estimated heterozygozity -- assembled contigs (fasta) -- bandage image (png) -- software versions +# Clean Assemble +## workflows/local/CleanAssemble + +## Included sub-workflows + +- `assemble_reads.nf` +- `clean_reads.nf` +- `hybrid_assembly.nf` +- `input_check.nf` +- `polish_assemblies.nf` + + +## Steps +1. **[QC reads](subworkflows/clean_reads)** subworkflow steps in brief are listed below, for further information see [clean_reads.nf](subworkflows/local/clean_reads.nf) + - Reads are checked for known sequencing contamination + - Quality metrics are calculated + - Reads are trimmed + - Coverage is estimated + - Read set subsampled to set level (OPTIONAL) + - Read set is assessed to be either an isolate or metagenomic sample (from presence of multiple taxa) + +2. **[Assemble reads](/subworkflows/assemble_reads)** using the `params.platform` flag, read sets will be diverted to either the assemble_reads (short reads) or hybrid_assembly (short and/or long reads) workflow. Though the data is handled differently in eash subworklow, both generate a contigs file and a bandage image, with an option of initial polishing via Racon. See [assemble_reads.nf](subworkflows/local/assemble_reads.nf) and [hybrid_assembly.nf](subworkflows/local/hybrid_assembly.nf) subworkflow pages for more details. + +3. **[Polish assembles](/subworkflows/polish_assemblies)** (OPTIONAL) Polishing of contigs can be added [polish_assemblies.nf](subworkflows/local/polish_assemblies.nf). To make changes to the default workflow, see setting 'optional flags' page. + +## Input +- Next generation sequencing reads: + + Short read - Illumina + + Long read: + * Nanopore + * Pacbio + +## Output +- quality trimmed and deconned reads (fastq) +- estimated genome size +- estimated heterozygozity +- assembled contigs (fasta) +- bandage image (png) +- software versions diff --git a/docs/workflows/PostAssembly.md b/docs/workflows/PostAssembly.md index e05dd8aa..fd884c66 100644 --- a/docs/workflows/PostAssembly.md +++ b/docs/workflows/PostAssembly.md @@ -1,30 +1,31 @@ -# Post assembly -## workflows/local/PostAssembly - -This workflow is triggered if only assemblies are input to the pipeline and is triggered after the `CleanAssemble.nf` workflow. Here Quast, CheckM, species determination (Using Kraken2 or Mash), annotation and subtyping are all performed. - -## Included sub-workflows - -- `qc_assemblies.nf` -- `determine_species.nf` -- `split_metagenomic.nf` -- `subtype_genome.nf` - -## Steps -1. **Determine type** - a. Isolate: proceeds to step 2. - b. Metagenomic: runs the following two modules before proceeding to step 2. - i. Kraken - ii. Bin contigs -2. **QC Assemblies** (OPTIONAL) -3. **Determine species** (OPTIONAL) -4. **Subtype genome** (OPTIONAL) -5. **Annotate genome** (OPTIONAL) -6. Multiqc? - -## Input -- Contig file (fasta) - -## Output -- Tab delimited file containing collated results from all subworkflows -- JSON file containing output of workflow outputs +# Post assembly +## workflows/local/PostAssembly + +This workflow is triggered in two ways: 1. when assemblies are used for initial input to the pipeline; and 2. after the `CleanAssemble.nf` workflow completes. Within this workflow, Quast, CheckM, species determination (Using Kraken2 or Mash), annotation and subtyping are all performed. + +## Included sub-workflows + +- `annotate_genomes.nf` +- `determine_species.nf` +- `polish_assemblies.nf` +- `qc_assemblies.nf` +- `split_metagenomic.nf` +- `subtype_genome.nf` + +## Steps +1. **Determine type** using the `metagenomic_samples` flag, this workflow will direct assemblies to the following two paths: + a. Isolate: proceeds to step 2. + b. Metagenomic: runs the following two modules before proceeding to step 2. + i. [kraken2.nf](modules/local/kraken.nf) runs kraken 2 on contigs + ii. [bin_kraken2.nf](modules/local/bin_kraken2.nf) bins contigs to respective genus level taxa +2. **[QC assemblies](/subworkflows/qc_assembly)** (OPTIONAL) runs quast and assigns quality metrics to generated assemblies +3. **[Determine species](/subworkflows/determine_species)** (OPTIONAL) runs classifier tool (default: [Mash](https://github.com/marbl/Mash)) to determine sample or binned species +4. **[Subtype genome](/subworkflows/subtype_genome)** (OPTIONAL) species specific subtyping tools are launched using a generated MASH screen report. +5. **[Annotate genome](/subworkflows/genomes_annotate)** (OPTIONAL) tools for annotation and identification of genes of interest are launched as a part of this step. + +## Input +- Contig file (fasta) + +## Output +- Tab delimited file containing collated results from all subworkflows +- JSON file containing output of workflow outputs diff --git a/mkdocs.yml b/mkdocs.yml index b4045cf8..508492d9 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -3,12 +3,17 @@ theme: name: material features: - navigation.tabs + - navigation.tabs.sticky - navigation.sections - - toc.integrate + - navigation.expand - navigation.top + - toc.integrate - search.suggest - search.highlight - content.tabs.link - content.code.annotation - content.code.copy language: en +plugins: + - search + - awesome-page