Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prodigal #240

Merged
merged 9 commits into from
Oct 29, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Added`

-[#240](https://github.com/nf-core/mag/pull/240) - Add prodigal to predict protein-coding genes for assemblies
AntoniaSchuster marked this conversation as resolved.
Show resolved Hide resolved

### `Changed`

### `Fixed`
Expand Down
2 changes: 2 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@

* [Porechop](https://github.com/rrwick/Porechop)

* [Prodigal](https://github.com/hyattpd/Prodigal)
AntoniaSchuster marked this conversation as resolved.
Show resolved Hide resolved

* [SAMtools](https://doi.org/10.1093/bioinformatics/btp352)
> Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25(16), 2078–2079. doi: 10.1093/bioinformatics/btp352.

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ The pipeline then:

* assigns taxonomy to reads using [Centrifuge](https://ccb.jhu.edu/software/centrifuge/) and/or [Kraken2](https://github.com/DerrickWood/kraken2/wiki)
* performs assembly using [MEGAHIT](https://github.com/voutcn/megahit) and [SPAdes](http://cab.spbu.ru/software/spades/), and checks their quality using [Quast](http://quast.sourceforge.net/quast)
* predicts protein-coding genes for the assemblies using [Prodigal](https://github.com/hyattpd/Prodigal)
* performs metagenome binning using [MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/), and checks the quality of the genome bins using [Busco](https://busco.ezlab.org/)
* assigns taxonomy to bins using [GTDB-Tk](https://github.com/Ecogenomics/GTDBTk) and/or [CAT](https://github.com/dutilh/CAT)

Expand Down
4 changes: 4 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -161,5 +161,9 @@ params {
'multiqc' {
args = ""
}
prodigal {
publish_dir = "Prodigal"
output_format = "gff"
}
AntoniaSchuster marked this conversation as resolved.
Show resolved Hide resolved
}
}
16 changes: 16 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
* [Quality control](#quality-control) of input reads - trimming and contaminant removal
* [Taxonomic classification of trimmed reads](#taxonomic-classification-of-trimmed-reads)
* [Assembly](#assembly) of trimmed reads
* [Protein-coding gene prediction](#gene-prediction) of assemblies
d4straub marked this conversation as resolved.
Show resolved Hide resolved
* [Binning](#binning) of assembled contigs
* [Taxonomic classification of binned genomes](#taxonomic-classification-of-binned-genomes)
* [Additional summary for binned genomes](#additional-summary-for-binned-genomes)
Expand Down Expand Up @@ -214,6 +215,21 @@ SPAdesHybrid is a part of the [SPAdes](http://cab.spbu.ru/software/spades/) soft

</details>

## Gene prediction

Protein-coding genes are predicted for each assembly.

<details markdown="1">
<summary>Output files</summary>

* `Prodigal/`
* `[sample/group].gff`: Gene Coordinates in GFF format
AntoniaSchuster marked this conversation as resolved.
Show resolved Hide resolved
* `[sample/group].faa`: The protein translation file consists of all the proteins from all the sequences in multiple FASTA format.
* `[sample/group].fna`: Nucleotide sequences of the predicted proteins using the DNA alphabet, not mRNA (so you will see 'T' in the output and not 'U').
* `[sample/group]_all.txt`: Starts file
AntoniaSchuster marked this conversation as resolved.
Show resolved Hide resolved

</details>

## Binning

### Contig sequencing depth
Expand Down
3 changes: 3 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
},
"fastqc": {
"git_sha": "e937c7950af70930d1f34bb961403d9d2aa81c7d"
},
"prodigal": {
"git_sha": "49da8642876ae4d91128168cd0db4f1c858d7792"
}
}
}
Expand Down
78 changes: 78 additions & 0 deletions modules/nf-core/modules/prodigal/functions.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
//
// Utility functions used in nf-core DSL2 module files
//

//
// Extract name of software tool from process name using $task.process
//
def getSoftwareName(task_process) {
return task_process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()
}

//
// Extract name of module from process name using $task.process
//
def getProcessName(task_process) {
return task_process.tokenize(':')[-1]
}

//
// Function to initialise default values and to generate a Groovy Map of available options for nf-core modules
//
def initOptions(Map args) {
def Map options = [:]
options.args = args.args ?: ''
options.args2 = args.args2 ?: ''
options.args3 = args.args3 ?: ''
options.publish_by_meta = args.publish_by_meta ?: []
options.publish_dir = args.publish_dir ?: ''
options.publish_files = args.publish_files
options.suffix = args.suffix ?: ''
return options
}

//
// Tidy up and join elements of a list to return a path string
//
def getPathFromList(path_list) {
def paths = path_list.findAll { item -> !item?.trim().isEmpty() } // Remove empty entries
paths = paths.collect { it.trim().replaceAll("^[/]+|[/]+\$", "") } // Trim whitespace and trailing slashes
return paths.join('/')
}

//
// Function to save/publish module results
//
def saveFiles(Map args) {
def ioptions = initOptions(args.options)
def path_list = [ ioptions.publish_dir ?: args.publish_dir ]

// Do not publish versions.yml unless running from pytest workflow
if (args.filename.equals('versions.yml') && !System.getenv("NF_CORE_MODULES_TEST")) {
return null
}
if (ioptions.publish_by_meta) {
def key_list = ioptions.publish_by_meta instanceof List ? ioptions.publish_by_meta : args.publish_by_meta
for (key in key_list) {
if (args.meta && key instanceof String) {
def path = key
if (args.meta.containsKey(key)) {
path = args.meta[key] instanceof Boolean ? "${key}_${args.meta[key]}".toString() : args.meta[key]
}
path = path instanceof String ? path : ''
path_list.add(path)
}
}
}
if (ioptions.publish_files instanceof Map) {
for (ext in ioptions.publish_files) {
if (args.filename.endsWith(ext.key)) {
def ext_list = path_list.collect()
ext_list.add(ext.value)
return "${getPathFromList(ext_list)}/$args.filename"
}
}
} else if (ioptions.publish_files == null) {
return "${getPathFromList(path_list)}/$args.filename"
}
}
48 changes: 48 additions & 0 deletions modules/nf-core/modules/prodigal/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
// Import generic module functions
include { initOptions; saveFiles; getSoftwareName; getProcessName } from './functions'

params.options = [:]
options = initOptions(params.options)

process PRODIGAL {
tag "$meta.id"
label 'process_low'
publishDir "${params.outdir}",
mode: params.publish_dir_mode,
saveAs: { filename -> saveFiles(filename:filename, options:params.options, publish_dir:getSoftwareName(task.process), meta:meta, publish_by_meta:['id']) }

conda (params.enable_conda ? "bioconda::prodigal=2.6.3" : null)
if (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container) {
container "https://depot.galaxyproject.org/singularity/prodigal:2.6.3--h516909a_2"
} else {
container "quay.io/biocontainers/prodigal:2.6.3--h516909a_2"
}

input:
tuple val(meta), path(genome)
val(output_format)

output:
tuple val(meta), path("${prefix}.${output_format}"), emit: gene_annotations
tuple val(meta), path("${prefix}.fna"), emit: nucleotide_fasta
tuple val(meta), path("${prefix}.faa"), emit: amino_acid_fasta
tuple val(meta), path("${prefix}_all.txt"), emit: all_gene_annotations
path "versions.yml" , emit: versions

script:
prefix = options.suffix ? "${meta.id}${options.suffix}" : "${meta.id}"
"""
prodigal -i "${genome}" \\
$options.args \\
-f $output_format \\
-d "${prefix}.fna" \\
-o "${prefix}.${output_format}" \\
-a "${prefix}.faa" \\
-s "${prefix}_all.txt"

cat <<-END_VERSIONS > versions.yml
${getProcessName(task.process)}:
${getSoftwareName(task.process)}: \$(prodigal -v 2>&1 | sed -n 's/Prodigal V\\(.*\\):.*/\\1/p')
END_VERSIONS
"""
}
41 changes: 41 additions & 0 deletions modules/nf-core/modules/prodigal/meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: prodigal
description: Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program
keywords:
- sort
tools:
- prodigal:
description: Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program
homepage: {}
documentation: {}
tool_dev_url: {}
doi: ""
licence: ["GPL v3"]

input:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- bam:
type: file
description: BAM/CRAM/SAM file
pattern: "*.{bam,cram,sam}"

output:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. [ id:'test', single_end:false ]
- versions:
type: file
description: File containing software versions
pattern: "versions.yml"
- bam:
type: file
description: Sorted BAM/CRAM/SAM file
pattern: "*.{bam,cram,sam}"

authors:
- "@grst"
13 changes: 13 additions & 0 deletions workflows/mag.nf
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ include { GTDBTK } from '../subworkflows/local/gtdbtk'
include { FASTQC as FASTQC_RAW } from '../modules/nf-core/modules/fastqc/main' addParams( options: modules['fastqc_raw'] )
include { FASTQC as FASTQC_TRIMMED } from '../modules/nf-core/modules/fastqc/main' addParams( options: modules['fastqc_trimmed'] )
include { FASTP } from '../modules/nf-core/modules/fastp/main' addParams( options: modules['fastp'] )
include { PRODIGAL } from '../modules/nf-core/modules/prodigal/main' addParams( options: modules['prodigal'] )

////////////////////////////////////////////////////
/* -- Create channel for reference databases -- */
Expand Down Expand Up @@ -466,6 +467,18 @@ workflow MAG {
ch_software_versions = ch_software_versions.mix(QUAST.out.version.first().ifEmpty(null))
}

/*
================================================================================
Predict proteins
================================================================================
*/

PRODIGAL (
ch_assemblies,
modules['prodigal']['output_format']
)
ch_software_versions = ch_software_versions.mix(PRODIGAL.out.versions.first().ifEmpty(null))

/*
================================================================================
Binning
Expand Down