Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update code for busco4.1.3 and add a few improvements #104

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Add MetaBAT2 RNG seed parameter `--metabat_rng_seed` and set the default to 1 which ensures reproducible binning results
- Add parameters `--megahit_fix_cpu_1`, `--spades_fix_cpus` and `--spadeshybrid_fix_cpus` to ensure reproducible results from assembly tools
- Add `nextflow_schema.json`
- Add parameter `--save_busco_reference` [#104](https://github.com/nf-core/mag/pull/104)

### `Changed`

- Update to new nf-core 1.10.2 `TEMPLATE` [#88](https://github.com/nf-core/mag/pull/88)
- `--reads` is now removed, use `--input` instead [#88](https://github.com/nf-core/mag/pull/88)
- Prevented PhiX alignments from being stored in work directory [#101](https://github.com/nf-core/mag/pull/101)
- Update BUSCO from v3.0.2 to v4.1.3 [#104](https://github.com/nf-core/mag/pull/104)

### `Fixed`

Expand Down
8 changes: 8 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,14 @@ process {
memory = { check_max (20.GB * task.attempt, 'memory' ) }
time = { check_max (8.h * task.attempt, 'time' ) }
}
withName: busco {
container = 'nfcore/magbusco:dev'
conda = "$baseDir/containers/busco/environment.yml"
}
withName: busco_plot {
container = 'nfcore/magbusco:dev'
conda = "$baseDir/containers/busco/environment.yml"
}
withName:get_software_versions {
cache = false
}
Expand Down
21 changes: 21 additions & 0 deletions containers/busco/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM nfcore/base:1.10.2
LABEL authors="Hadrien Gourlé <[email protected]>, Daniel Straub <[email protected]>, Sabrina Krakau <[email protected]>" \
description="Docker image containing BUSCO requirements for the nf-core/mag pipeline"

# Install the conda environment
COPY environment.yml /
RUN conda env create --quiet -f /environment.yml && conda clean -a

# For BUSCO to avoid segfault error with R 4.0.2
RUN apt-get update
RUN apt-get install -y libxt6

# Add conda installation dir to PATH (instead of doing 'conda activate')
ENV PATH /opt/conda/envs/nf-core-mag-busco-1.1.0dev/bin:$PATH

# Dump the details of the installed packages to a file for posterity
RUN conda env export --name nf-core-mag-busco-1.1.0dev > nf-core-mag-busco-1.1.0dev.yml

# Instruct R processes to use these empty files instead of clashing with a local version
RUN touch .Rprofile
RUN touch .Renviron
21 changes: 21 additions & 0 deletions containers/busco/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# You can use this file to create a conda environment for this pipeline:
# conda env create -f environment.yml
name: nf-core-mag-busco-1.1.0dev
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- conda-forge::python=3.7.3
- bioconda::prodigal=2.6.3 # fix versions of third-party tools used by busco to avoid problems
- bioconda::augustus=3.3.3
- bioconda::diamond=2.0.1
- bioconda::hmmer=3.1b2
- bioconda::blast=2.10.1
- bioconda::sepp=4.3.10
- conda-forge::cairo=1.16.0 # to avoid segfault error with R 4.0.2
- conda-forge::r-rgtk2=2.20.36 # same as above
- conda-forge::fonts-conda-ecosystem=1 # to avoid missing fonts in R (https://github.com/conda-forge/r-base-feedstock/issues/91)
- conda-forge::r-base=4.0.2 # busco requires R >= 4
- conda-forge::r-ggplot2=3.3.2
- bioconda::busco=4.1.3
6 changes: 6 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,12 @@ Files in these two folders contain all contigs of an assembly.
* `[assembler]-[bin]_busco.fna`: Nucleotide sequence of all identified BUSCOs
* `[assembler]-[bin]_busco.faa`: Aminoacid sequence of all identified BUSCOs

If the parameter `--save_busco_reference` is set the used BUSCO reference is stored.

**Output directory: `GenomeBinning/QC/BUSCO/reference`**

* `*.tar.gz`: BUSCO reference file

**Output directory: `GenomeBinning/QC`**

* `busco_summary.txt`: A summary table of the BUSCO results, with % of marker genes found
Expand Down
2 changes: 2 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,8 @@ You can fix this by using the prameter `--megahit_fix_cpu_1`. In both cases, do

MetaBAT2 is run by default with a fixed seed within this pipeline, thus producing reproducible results.

To allow also reproducible bin QC with BUSCO, the parameter `--save_busco_reference` can be used to store the reference database. This may be useful since BUSCO datasets are frequently updated and old versions do not always remain accessible.

## Core Nextflow arguments

> **NB:** These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen).
Expand Down
3 changes: 1 addition & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,12 @@ dependencies:
- bioconda::bowtie2=2.3.5
- bioconda::quast=5.0.2
- bioconda::prodigal=2.6.3
- bioconda::diamond=0.9.24 # 0.9.25 conflicts with Busco over boost
- bioconda::diamond=0.9.24 # 0.9.25 conflicted with Busco over boost
- conda-forge::r=3.6
- conda-forge::biopython=1.74
- bioconda::krona=2.7.1
- conda-forge::r-markdown=1.0
- conda-forge::r-ggplot2=3.2.0
- bioconda::busco=3.0.2
- bioconda::nanoplot=1.26.3
- bioconda::filtlong=0.2.0
- bioconda::porechop=0.2.3_seqan2.1.1
Expand Down
193 changes: 108 additions & 85 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,8 @@ def helpMessage() {
Bin quality check:
--skip_busco [bool] Disable bin QC with BUSCO (default: false)
--busco_reference [file] Download path for BUSCO database, available databases are listed here: https://busco.ezlab.org/
(default: https://busco-archive.ezlab.org/v3/datasets/bacteria_odb9.tar.gz)
(default: https://busco-data.ezlab.org/v4/data/lineages/bacteria_odb10.2020-03-06.tar.gz)
--save_busco_reference [bool] Save BUSCO reference. Useful to allow reproducibility, as BUSCO datasets are frequently updated and old versions do not always remain accessible.

Reproducibility options:
--megahit_fix_cpu_1 [bool] Fix number of CPUs for MEGAHIT to 1. Not increased with retries (default: false)
Expand Down Expand Up @@ -1132,12 +1133,15 @@ process metabat {

process busco_download_db {
tag "${database.baseName}"
publishDir "${params.outdir}/GenomeBinning/QC/BUSCO/", mode: params.publish_dir_mode,
saveAs: {filename -> (params.save_busco_reference && filename.indexOf(".tar.gz") > 0) ? "reference/$filename" : null}

input:
file(database) from file_busco_db

output:
set val("${database.toString().replace(".tar.gz", "")}"), file("buscodb/*") into busco_db
file("buscodb/*") into busco_db
file(database)

script:
"""
Expand All @@ -1155,114 +1159,133 @@ metabat_bins
* BUSCO: Quantitative measures for the assessment of genome assembly
*/
process busco {
tag "${assembly}"
tag "${bin}"
publishDir "${params.outdir}/GenomeBinning/QC/BUSCO/", mode: params.publish_dir_mode

input:
set val(assembler), val(sample), file(assembly), val(db_name), file(db) from metabat_db_busco
set val(assembler), val(sample), file(bin), file(db) from metabat_db_busco

output:
file("short_summary_${assembly}.txt") into (busco_summary_to_multiqc, busco_summary_to_plot)
val("$assembler-$sample") into busco_assembler_sample_to_plot
file("${assembly}_busco.log")
file("${assembly}_buscos.faa")
file("${assembly}_buscos.fna")
set val(assembler), val(sample), file("short_summary.specific.*.${bin}.txt") into (ch_busco_multiqc, ch_busco_to_summary, ch_busco_plot)
file("${bin}_busco.log")
file("${bin}_buscos.faa") optional true
file("${bin}_buscos.fna") optional true

script:
if( workflow.profile.toString().indexOf("conda") == -1) {
"""
if( workflow.profile.toString().indexOf("conda") == -1)
cp_augustus_config = "Y"
else
cp_augustus_config = "N"

"""
# get path to custom config file for busco (already configured during conda installation)
busco_path="\$(which busco)"
config_file="\${busco_path%bin/busco}share/busco/config.ini"

# ensure augustus has write access to config directory
if [ ${cp_augustus_config} = "Y" ] ; then
cp -r /opt/conda/pkgs/augustus*/config augustus_config/
export AUGUSTUS_CONFIG_PATH=augustus_config

run_BUSCO.py \
--in ${assembly} \
--lineage_path $db_name \
--cpu "${task.cpus}" \
--blast_single_core \
--mode genome \
--out ${assembly} \
>${assembly}_busco.log
cp run_${assembly}/short_summary_${assembly}.txt short_summary_${assembly}.txt

for f in run_${assembly}/single_copy_busco_sequences/*faa; do
[ -e "\$f" ] && cat run_${assembly}/single_copy_busco_sequences/*faa >${assembly}_buscos.faa || touch ${assembly}_buscos.faa
break
done
for f in run_${assembly}/single_copy_busco_sequences/*fna; do
[ -e "\$f" ] && cat run_${assembly}/single_copy_busco_sequences/*fna >${assembly}_buscos.fna || touch ${assembly}_buscos.fna
break
done
"""
} else {
"""
run_BUSCO.py \
--in ${assembly} \
--lineage_path $db_name \
--cpu "${task.cpus}" \
--blast_single_core \
--mode genome \
--out ${assembly} \
>${assembly}_busco.log
cp run_${assembly}/short_summary_${assembly}.txt short_summary_${assembly}.txt

for f in run_${assembly}/single_copy_busco_sequences/*faa; do
[ -e "\$f" ] && cat run_${assembly}/single_copy_busco_sequences/*faa >${assembly}_buscos.faa || touch ${assembly}_buscos.faa
break
done
for f in run_${assembly}/single_copy_busco_sequences/*fna; do
[ -e "\$f" ] && cat run_${assembly}/single_copy_busco_sequences/*fna >${assembly}_buscos.fna || touch ${assembly}_buscos.fna
break
done
"""
}
fi

# place db in extra folder to ensure BUSCO recognizes it as path (instead of downloading it)
mkdir dataset
mv ${db} dataset/

busco --lineage_dataset dataset/${db} \
--mode genome \
--in ${bin} \
--config \${config_file} \
--cpu "${task.cpus}" \
--out "BUSCO" > ${bin}_busco.log

# get used db name
# (set nullgob: if pattern matches no files, expand to a null string rather than to itself)
shopt -s nullglob
summaries=(BUSCO/short_summary.specific.*.BUSCO.txt)
if [ \${#summaries[@]} -ne 1 ]; then
echo "ERROR: none or multiple 'BUSCO/short_summary.specific.*.BUSCO.txt' files found. Expected one."
exit 1
fi
[[ \$summaries =~ BUSCO/short_summary.specific.(.*).BUSCO.txt ]];
db_name="\${BASH_REMATCH[1]}"
echo "Used database: \${db_name}"

cp BUSCO/short_summary.specific.\${db_name}.BUSCO.txt short_summary.specific.\${db_name}.${bin}.txt

for f in BUSCO/run_\${db_name}/busco_sequences/single_copy_busco_sequences/*faa; do
cat BUSCO/run_\${db_name}/busco_sequences/single_copy_busco_sequences/*faa >${bin}_buscos.faa
break
done
for f in BUSCO/run_\${db_name}/busco_sequences/single_copy_busco_sequences/*fna; do
cat BUSCO/run_\${db_name}/busco_sequences/single_copy_busco_sequences/*fna >${bin}_buscos.fna
break
done
"""
}

// preprare channels for downstream processes
ch_busco_multiqc = ch_busco_multiqc.map{it[2]}
ch_busco_to_summary = ch_busco_to_summary.map{it[2]}

process busco_plot {
publishDir "${params.outdir}/GenomeBinning/QC/", mode: params.publish_dir_mode
// group by assembler and sample for plotting
ch_busco_plot = ch_busco_plot.groupTuple(by: [0,1])

process busco_plot {
tag "$assembler-$sample"
publishDir "${params.outdir}/GenomeBinning/QC/BUSCO/", mode: params.publish_dir_mode

input:
file(summaries) from busco_summary_to_plot.collect()
val(assemblersample) from busco_assembler_sample_to_plot.collect()
set val(assembler), val(sample), file(summaries) from ch_busco_plot

output:
file("*busco_figure.png")
file("BUSCO/*busco_figure.R")
file("BUSCO/*busco_summary.txt")
file("busco_summary.txt") into busco_summary
file("${assembler}-${sample}-busco_figure.png")
file("${assembler}-${sample}-busco_figure.R")
file("${assembler}-${sample}-busco_summary.txt")

script:
def assemblersampleunique = assemblersample.unique()
def name = "${assembler}-${sample}"
"""
#for each assembler and sample:
assemblersample=\$(echo \"$assemblersampleunique\" | sed 's/[][]//g')
IFS=', ' read -r -a assemblersamples <<< \"\$assemblersample\"
# replace dots in bin names within summary file names by underscores
# currently (BUSCO v4.1.3) generate_plot.py does not allow further dots
for sum in ${summaries}; do
[[ \${sum} =~ short_summary.(.*).${name}.(.*).txt ]];
db_name=\${BASH_REMATCH[1]}
bin="${name}.\${BASH_REMATCH[2]}"
bin_new="\${bin//./_}"
mv \${sum} short_summary.\${db_name}.\${bin_new}.txt
done
generate_plot.py --working_directory .

mkdir BUSCO
mv busco_figure.png ${assembler}-${sample}-busco_figure.png
mv busco_figure.R ${assembler}-${sample}-busco_figure.R

for name in \"\${assemblersamples[@]}\"; do
mkdir \${name}
cp short_summary_\${name}* \${name}/
generate_plot.py --working_directory \${name}

cp \${name}/busco_figure.png \${name}-busco_figure.png
cp \${name}/busco_figure.R \${name}-busco_figure.R
summary_busco.py short_summary.*.txt > ${assembler}-${sample}-busco_summary.txt
"""
}

summary_busco.py \${name}/short_summary_*.txt >BUSCO/\${name}-busco_summary.txt
done
process busco_summary {
publishDir "${params.outdir}/GenomeBinning/QC/", mode: params.publish_dir_mode

input:
file("short_summary.*.txt") from ch_busco_to_summary.collect()

cp *-busco_figure.R BUSCO/
output:
file("busco_summary.txt") into busco_summary

summary_busco.py short_summary_*.txt >busco_summary.txt
script:
"""
summary_busco.py short_summary.*.txt > busco_summary.txt
"""
}


process quast_bins {
tag "$assembler-$sample"
publishDir "${params.outdir}/GenomeBinning/QC/", mode: params.publish_dir_mode

input:
set val(assembler), val(sample), file(assembly) from metabat_bins_quast_bins
set val(assembler), val(sample), file(bins) from metabat_bins_quast_bins

output:
path("QUAST/*") type('dir')
Expand All @@ -1273,15 +1296,15 @@ process quast_bins {

script:
"""
ASSEMBLIES=\$(echo \"$assembly\" | sed 's/[][]//g')
IFS=', ' read -r -a assemblies <<< \"\$ASSEMBLIES\"
BINS=\$(echo \"$bins\" | sed 's/[][]//g')
IFS=', ' read -r -a bins <<< \"\$BINS\"

for assembly in \"\${assemblies[@]}\"; do
metaquast.py --threads "${task.cpus}" --max-ref-number 0 --rna-finding --gene-finding -l "\${assembly}" "\${assembly}" -o "QUAST/\${assembly}"
for bin in \"\${bins[@]}\"; do
metaquast.py --threads "${task.cpus}" --max-ref-number 0 --rna-finding --gene-finding -l "\${bin}" "\${bin}" -o "QUAST/\${bin}"
if ! [ -f "QUAST/${assembler}-${sample}-quast_summary.tsv" ]; then
cp "QUAST/\${assembly}/transposed_report.tsv" "QUAST/${assembler}-${sample}-quast_summary.tsv"
cp "QUAST/\${bin}/transposed_report.tsv" "QUAST/${assembler}-${sample}-quast_summary.tsv"
else
tail -n +2 "QUAST/\${assembly}/transposed_report.tsv" >> "QUAST/${assembler}-${sample}-quast_summary.tsv"
tail -n +2 "QUAST/\${bin}/transposed_report.tsv" >> "QUAST/${assembler}-${sample}-quast_summary.tsv"
fi
done
"""
Expand Down Expand Up @@ -1383,7 +1406,7 @@ process multiqc {
file (fastqc_trimmed:'fastqc/*') from fastqc_results_trimmed.collect().ifEmpty([])
file (host_removal) from ch_host_removed_log.collect().ifEmpty([])
file ('quast*/*') from quast_results.collect().ifEmpty([])
file (short_summary) from busco_summary_to_multiqc.collect().ifEmpty([])
file (short_summary) from ch_busco_multiqc.collect().ifEmpty([])
file ('software_versions/*') from ch_software_versions_yaml.collect()
file workflow_summary from ch_workflow_summary.collectFile(name: "workflow_summary_mqc.yaml")

Expand Down
Loading