Merge pull request #125 from phac-nml/rasusa

Rasusa
phac-nml · Oct 3, 2024 · 7a3231a · 7a3231a
2 parents 645ad7f + 667b67d
commit 7a3231a
Show file tree

Hide file tree

Showing 14 changed files with 1,487 additions and 81 deletions.
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -169,3 +169,5 @@ dependentRequired
 errorMessage
 Samplesheet
 TSeemann's
+RASUSA
+downsampling
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## Unreleased
 
+### `Changed`
+
+- Added RASUSA for down sampling of Nanopore or PacBio data. [PR 125](https://github.com/phac-nml/mikrokondo/pull/125)
+
 ### `Updated`
 
 - Documentation and workflow diagram has been updated. [PR 123](https://github.com/phac-nml/mikrokondo/pull/123)

diff --git a/conf/modules.config b/conf/modules.config
@@ -332,6 +332,20 @@ process {
         ]
     }
 
+    withName: RASUSA {
+        ext.args = ""
+        ext.parameters = params.rasusa
+        publishDir = [
+            [
+                path: { [ "${task.read_downsampled_directory_name}", "Rasusa" ].join(File.separator) },
+                mode: params.publish_dir_mode,
+                pattern: "*${params.rasusa.reads_ext}",
+                saveAs: { filename ->
+                    filename.equals('versions.yml') ? null : reformat_output(filename, "reads", "rasusa.sample", meta) }
+            ]
+        ]
+    }
+
 
     withName: SEQTK_SIZE {
         ext.args = ""

diff --git a/docs/subworkflows/clean_reads.md b/docs/subworkflows/clean_reads.md
@@ -1,49 +1,49 @@
-# Read Quality Control
-
-## subworkflows/local/clean_reads
-
-## Steps
-1. **Reads are decontaminated** using [minimap2](https://github.com/lh3/minimap2), against a 'sequencing off-target' index. This index contains:
-	- Reads associated with Humans (de-hosting)
-	- Known sequencing controls (phiX)
-
-2. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp)
-	- Currently no adapters are specified within FastP when it is run and auto-detection is used.
-	- FastP parameters can be altered within the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) file.
-	- Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. [Chopper](https://github.com/wdecoster/chopper) is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request.
-
-3. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output.
-
-4. **Read downsampling** (OPTIONAL) an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line.
-	- Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash)
-	- Total basepairs are taken from [FastP](https://github.com/OpenGene/fastp)
-	- Read downsampling is then performed using [Seqtk](https://github.com/lh3/seqtk)
-
-5. **Metagenomic assesment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, this step assesses how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblies and contigs will be binned at a defined taxonomic level (default level: genus).
-
-6. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so default setting is to skip, `--skip_ont_header_cleaning true`.
-
-## Input
-- Next generation sequencing reads:
-	+ Short read - Illumina
-	+ Long read:
-		* Nanopore
-		* Pacbio
-- User submitted sample sheet		
-
-
-## Outputs
-- Reads
-	- FinalReads
-		- SAMPLE
-	- Processing
-		- Dehosting
-			- Trimmed
-				- FastP
-					- Seqtk
-				- MashSketches
-	- Quality
-		- RawReadQuality
-		- Trimmed
-			- FastP
-			- MashScreen
+# Read Quality Control
+
+## subworkflows/local/clean_reads
+
+## Steps
+1. **Reads are decontaminated** using [minimap2](https://github.com/lh3/minimap2), against a 'sequencing off-target' index. This index contains:
+	- Reads associated with Humans (de-hosting)
+	- Known sequencing controls (phiX)
+
+2. **Read quality filtering and trimming** is performed using [FastP](https://github.com/OpenGene/fastp)
+	- Currently no adapters are specified within FastP when it is run and auto-detection is used.
+	- FastP parameters can be altered within the [nextflow.config](https://github.com/phac-nml/mikrokondo/blob/main/nextflow.config) file.
+	- Long read data is also run through FastP for gathering of summary data, however long read (un-paired reads) trimming is not performed and only summary metrics are generated. [Chopper](https://github.com/wdecoster/chopper) is currently integrated in MikroKondo but it has been removed from this workflow due to a lack of interest in quality trimming of long read data. It may be reintroduced in the future upon request.
+
+3. **Genome size estimation** is performed using [Mash](https://github.com/marbl/Mash) Sketch of reads and estimated genome size is output.
+
+4. **Read down sampling** (OPTIONAL) an estimated depth threshold can be specified to down sample large read sets. This step can be used to improve genome assembly quality, and is something that can be found in other assembly pipelines such as [Shovill](https://github.com/tseemann/shovill). To disable down sampling add `--skip_depth_sampling true` to your command line.
+	- Depth is estimated by using the estimated genome size output from [Mash](https://github.com/marbl/Mash)
+	- Total base pairs are taken from [FastP](https://github.com/OpenGene/fastp)
+	- Read down sampling is then performed using [Seqtk](https://github.com/lh3/seqtk) (Illumina) or [Rasusa](https://github.com/mbhall88/rasusa) (Nanopore or Pacbio).
+
+5. **Metagenomic assessment** using a custom [Mash](https://github.com/marbl/Mash) 'sketch' file generated from the Genome Taxonomy Database [GTDB](https://gtdb.ecogenomic.org/) and the mash_screen module, this step assesses how many bacterial genera are present in a sample (e.g. a contaminated or metagenomic sample may have more than one genus of bacteria present) with greater than 90% identity (according to Mash). When more than 1 taxa are present, the metagenomic tag is set, turning on metagenomic assembly in later steps. Additionally Kraken2 will be run on metagenomic assemblies and contigs will be binned at a defined taxonomic level (default level: genus).
+
+6. **Nanopore ID screening** duplicate Nanopore read ID's have been known to cause issues in the pipeline downstream. In order to bypass this issue, an option can be toggled where a script will read in Nanopore reads and append a unique ID to the header, this process can be slow so default setting is to skip, `--skip_ont_header_cleaning true`.
+
+## Input
+- Next generation sequencing reads:
+	+ Short read - Illumina
+	+ Long read:
+		* Nanopore
+		* Pacbio
+- User submitted sample sheet
+
+
+## Outputs
+- Reads
+	- FinalReads
+		- SAMPLE
+	- Processing
+		- Dehosting
+			- Trimmed
+				- FastP
+					- Seqtk
+				- MashSketches
+	- Quality
+		- RawReadQuality
+		- Trimmed
+			- FastP
+			- MashScreen
diff --git a/docs/usage/tool_params.md b/docs/usage/tool_params.md
@@ -15,15 +15,15 @@ Screens contigs for antimicrobial and virulence genes. If you wish to use a diff
     - singularity: Abricate singularity container
     - docker: Abricate docker container
     - **args**: Can be a string of additional command line arguments to pass to abricate
-    - report_tag: determines the name of the Abricate output in the final summary file. **Do no touch this unless doing pipeline development.**
-    - header_p: This field tells the report module that the Abricate output contains headers. **Do no touch this unless doing pipeline development.**
+    - report_tag: determines the name of the Abricate output in the final summary file. **Do not change this unless doing pipeline development.**
+    - header_p: This field tells the report module that the Abricate output contains headers. **Do not change this unless doing pipeline development.**
 
 ### Raw Read Metrics
 A custom Python script that gathers quality metrics for each fastq file.
 
 - raw_reads
     - high_precision: When set to true, floating point precision of values output are accurate down to very small decimal places. Recommended to leave this setting as false (use the standard floats), it is much faster and having such precise decimal places is not needed for this module.
-    - report_tag: this field determines the name of the Raw Read Metric field in the final summary report. **Do no touch this unless doing pipeline development.**
+    - report_tag: this field determines the name of the Raw Read Metric field in the final summary report. **Do not change this unless doing pipeline development.**
 
 ### Coreutils
 In cases where a process uses bash scripting only, Nextflow by default will utilize system binaries when they are available and no container is specified. For reproducibility, we have chosen to use containers in such cases. When a better container is available, you can direct the pipeline to use it via below commands:
@@ -47,21 +47,30 @@ Kat was previously used to estimate genome size, however at the time of writing
 Seqtk is used for both the sub-sampling of reads and conversion of fasta files to fastq files in mikrokondo. The usage of seqtk to convert a fasta to a fastq is needed in certain typing tools requiring reads as input (this was a design decision to keep the pipeline generalizable).
 
 - seqtk
-    - singularity: singularity container for seqtk
-    - docker: docker container for seqtk
+    - singularity: Singularity container for seqtk
+    - docker: Docker container for seqtk
     - seed: A seed value for sub-sampling
     - reads_ext: Extension of reads after sub-sampling, do not touch alter this unless doing pipeline development.
-    - assembly_fastq: Extension of the fastas after being converted to fastq files. Do no touch this unless doing pipeline development.
-    - report_tag: Name of seqtk data in the final summary report. Do no touch this unless doing pipeline development.
+    - assembly_fastq: Extension of the fastas after being converted to fastq files. Do not change this unless doing pipeline development.
+    - report_tag: Name of seqtk data in the final summary report. Do not change this unless doing pipeline development.
+
+### Rasusa
+For long read data Rasusa is used for down sampling as it take read length into consideration when down sampling.
+
+- rasusa
+    - singularity: singularity container for rasusa
+    - docker: docker container for rasusa
+    - seed: A seed value for sub-sampling
+    - reads_ext: The extension of the generated fastq files. Do not change this unless doing pipeline development.
 
 ### FastP
 Fastp is fast and widely used program for gathering of read quality metrics, adapter trimming, read filtering and read trimming. FastP has extensive options for configuration which are detailed in their documentation, but sensible defaults have been set. **Adapter trimming in Fastp is performed using overlap analysis, however if you do not trust this you can specify the sequencing adapters used directly in the additional arguments for Fastp**.
 
 - fastp
     - singularity: singularity container for FastP
     - docker: docker container for FastP
-    - fastq_ext: extension of the output Fastp trimmed reads, do not touch this unless doing pipeline development.
-    - html_ext: Extension of the html report output by fastp, do no touch unless doing pipeline development.
+    - fastq_ext: extension of the output Fastp trimmed reads, Do not change this unless doing pipeline development.
+    - html_ext: Extension of the html report output by fastp, Do not touch unless doing pipeline development.
     - json_ext: Extension of json report output by FastP do not touch unless doing pipeline development.
     - report_tag: Title of FastP data in the summary report.
     - **average_quality_e**: If a read/read-pair quality is less than this value it is discarded. Can be set from the command line with `--fp_average_quality`.

diff --git a/modules/local/rasusa.nf b/modules/local/rasusa.nf
@@ -0,0 +1,26 @@
+/*
+    Downsample long reads with Rasusa
+*/
+
+process RASUSA {
+    tag "$meta.id"
+    label 'process_low'
+    container "${workflow.containerEngine == 'singularity' || workflow.containerEngine == 'apptainer' ? task.ext.parameters.get('singularity') : task.ext.parameters.get('docker')}"
+
+    input:
+    tuple val(meta), path(reads), val(sample_fraction)
+
+    output:
+    tuple val(meta), path("*${params.rasusa.reads_ext}"), val(sample_fraction), emit: sampled_reads
+    path "versions.yml", emit: versions
+
+    script:
+    def prefix = task.ext.prefix ?: "${meta.id}"
+    """
+    rasusa reads -f ${sample_fraction} -s ${params.rasusa.seed} -O g -o ${prefix}${params.rasusa.reads_ext} ${reads.join(" ")}
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        rasusa: \$(rasusa --version 2>&1 | sed -e "s/rasusa //g")
+    END_VERSIONS
+    """
+}
diff --git a/modules/local/remove_contaminants.nf b/modules/local/remove_contaminants.nf
@@ -57,7 +57,6 @@ process REMOVE_CONTAMINANTS {
         reads_out = "-1 ${meta.id}.R1.${params.r_contaminants.samtools_output_suffix}${params.r_contaminants.samtools_output_ext} -2 ${meta.id}.R2.${params.r_contaminants.samtools_output_suffix}${params.r_contaminants.samtools_output_ext} -s ${meta.id}${params.r_contaminants.samtools_singletons_ext}"
     }
     def zip_singletons = singled_ended ? "" : "gzip *${params.r_contaminants.samtools_singletons_ext}"
-    // TODO currently using a megaindex, but there may be a better way
 
     // -f4 in samtool view filters out unmapped reads
     // -N added to add /1 and /2 to reads with the same name

diff --git a/modules/local/seqtk_sample.nf b/modules/local/seqtk_sample.nf
@@ -10,7 +10,7 @@ process SEQTK_SAMPLE{
     tuple val(meta), path(reads), val(sample_fraction)
 
     output:
-    // TODO outputting sample fraction to match cardinality of non sampled read set, need to find a better solution...
+    // Outputting sample fraction to match cardinality of non sampled read set, should not be passed to the process in the future.
     tuple val(meta), path("*${params.seqtk.reads_ext}"), val(sample_fraction), emit: sampled_reads
     path "versions.yml", emit: versions
 

diff --git a/nextflow.config b/nextflow.config
@@ -197,6 +197,13 @@ params {
         docker = "biocontainers/python:3.12"
     }
 
+    rasusa {
+        docker = "biocontainers/rasusa:2.1.0--h715e4b3_0"
+        singularity = "https://depot.galaxyproject.org/singularity/rasusa:2.1.0--h715e4b3_0"
+        reads_ext = ".sampled.fastq.gz"
+        seed = 42
+    }
+
     seqtk {
         singularity = 'https://depot.galaxyproject.org/singularity/seqtk%3A1.4--he4a0461_1'
         docker = 'biocontainers/seqtk:1.4--he4a0461_1'