Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bin QC Improvements #707

Open
wants to merge 32 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
53864d2
feat: Migrate from CheckM to CheckM2
dialvarezs Oct 27, 2024
ca2f97b
fix: Bring back CheckM
dialvarezs Oct 27, 2024
05e3393
fix: One more thing
dialvarezs Oct 27, 2024
3e49dbf
fix: Linting
dialvarezs Oct 27, 2024
b1b6518
fix: Option and checks in bin summary script
dialvarezs Oct 27, 2024
73b5794
Fix: missing import
dialvarezs Oct 27, 2024
325310b
docs: Output, changelog and citation for CheckM2
dialvarezs Oct 28, 2024
b88302a
docs: readme
dialvarezs Oct 28, 2024
d62a671
Merge remote-tracking branch 'upstream/dev' into dev-checkm2
dialvarezs Oct 28, 2024
3da7441
docs: Update readme
dialvarezs Oct 28, 2024
3d98415
refactor: Merge checkm and checkm2 subworkflows in a single one
dialvarezs Oct 28, 2024
8b4fcc7
fix: Restore mistakenly deleted code
dialvarezs Oct 28, 2024
76dac5c
Bin QC workflow
dialvarezs Oct 31, 2024
0ebab07
Merge remote-tracking branch 'upstream/dev' into dev-checkm2
dialvarezs Oct 31, 2024
38ce756
Cleanup
dialvarezs Oct 31, 2024
3702329
Final touches
dialvarezs Oct 31, 2024
0eb167a
Integrate GUNC in BIN_QC subworkflow
dialvarezs Nov 1, 2024
f0a6999
Code style improvements
dialvarezs Nov 1, 2024
7e21b62
Merge branch 'dev' into dev-checkm2
jfy133 Nov 25, 2024
3158c52
Address several review comments
dialvarezs Nov 27, 2024
6152696
Update modules
dialvarezs Nov 27, 2024
ef1828c
Make binqc_tool an input for BIN_SUMMARY
dialvarezs Nov 27, 2024
bcae7f9
Move bin qc database setup to the subworkflow
dialvarezs Nov 27, 2024
158ee12
Fix
dialvarezs Nov 27, 2024
5738215
Remove another it
dialvarezs Nov 27, 2024
49ebb71
Remove checkm/checkm2 ci tests
dialvarezs Nov 27, 2024
4ecd36f
gtdbtk: add check when params.busco_db is not defined
dialvarezs Nov 28, 2024
1cf1da7
Don't flatten bins for GUNC
dialvarezs Nov 28, 2024
ea9075f
Update changelog
dialvarezs Nov 28, 2024
a079596
Emit bash version on busco_save_download
dialvarezs Nov 28, 2024
80641b0
Fix declaration
dialvarezs Nov 28, 2024
4b084b1
Update checkm/qa module
dialvarezs Dec 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 0 additions & 33 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -130,36 +130,3 @@ jobs:
- name: Run pipeline with ${{ matrix.test_name }} test profile
run: |
nextflow run ${GITHUB_WORKSPACE} -profile ${{ matrix.test_name }},docker --outdir ./results

checkm:
name: Run single test to checkm due to database download
# Only run on push if this is the nf-core dev branch (merged PRs)
if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/mag') }}
runs-on: ubuntu-latest

steps:
- name: Free some space
run: |
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"

- name: Check out pipeline code
uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4

- name: Install Nextflow
run: |
wget -qO- get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/

- name: Clean up Disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1

- name: Download and prepare CheckM database
run: |
mkdir -p databases/checkm
wget https://zenodo.org/records/7401545/files/checkm_data_2015_01_16.tar.gz -P databases/checkm
tar xzvf databases/checkm/checkm_data_2015_01_16.tar.gz -C databases/checkm/

- name: Run pipeline with ${{ matrix.profile }} test profile
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results --binqc_tool checkm --checkm_db databases/checkm
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- [#692](https://github.com/nf-core/mag/pull/692) - Added Nanoq as optional longread filtering tool (added by @muabnezor)
- [#692](https://github.com/nf-core/mag/pull/692) - Added chopper as optional longread filtering tool and/or phage lambda removal tool (added by @muabnezor)
- [#707](https://github.com/nf-core/mag/pull/707) - Make Bin QC a subworkflow (added by @dialvarezs)
- [#707](https://github.com/nf-core/mag/pull/707) - Added CheckM2 as an alternative bin completeness and QC tool (added by @dialvarezs)
- [#708](https://github.com/nf-core/mag/pull/708) - Added `--exclude_unbins_from_postbinning` parameter to exclude unbinned contigs from post-binning processes, speeding up Prokka in some cases (added by @dialvarezs)

### `Changed`

### `Fixed`

- [#708](https://github.com/nf-core/mag/pull/708) - Fixed channel passed as GUNC input (added by @dialvarezs)
- [#707](https://github.com/nf-core/mag/pull/708) - Fixed channel passed as GUNC input (added by @dialvarezs)

### `Dependencies`

| Tool | Previous version | New version |
| ------- | ---------------- | ----------- |
| CheckM | 1.2.1 | 1.2.3 |
| CheckM2 | | 1.0.2 |
| chopper | | 0.9.0 |
| GUNC | 1.0.5 | 1.0.6 |
| nanoq | | 0.10.0 |

### `Deprecated`
Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@

> Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. doi: 10.1101/gr.186072.114

- [CheckM2](https://doi.org/10.1038/s41592-023-01940-w)

> Chklovski, A., Parks, D. H., Woodcroft, B. J., & Tyson, G. W. (2023). CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nature Methods, 20(8), 1203-1212.

- [Chopper](https://doi.org/10.1093/bioinformatics/bty149)

> De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics. 2018 Aug 1;34(15):2666-2669. doi: 10.1093/bioinformatics/bty149
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ The pipeline then:
- performs assembly using [MEGAHIT](https://github.com/voutcn/megahit) and [SPAdes](http://cab.spbu.ru/software/spades/), and checks their quality using [Quast](http://quast.sourceforge.net/quast)
- (optionally) performs ancient DNA assembly validation using [PyDamage](https://github.com/maxibor/pydamage) and contig consensus sequence recalling with [Freebayes](https://github.com/freebayes/freebayes) and [BCFtools](http://samtools.github.io/bcftools/bcftools.html)
- predicts protein-coding genes for the assemblies using [Prodigal](https://github.com/hyattpd/Prodigal), and bins with [Prokka](https://github.com/tseemann/prokka) and optionally [MetaEuk](https://www.google.com/search?channel=fs&client=ubuntu-sn&q=MetaEuk)
- performs metagenome binning using [MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/), [MaxBin2](https://sourceforge.net/projects/maxbin2/), and/or with [CONCOCT](https://github.com/BinPro/CONCOCT), and checks the quality of the genome bins using [Busco](https://busco.ezlab.org/), or [CheckM](https://ecogenomics.github.io/CheckM/), and optionally [GUNC](https://grp-bork.embl-community.io/gunc/).
- performs metagenome binning using [MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/), [MaxBin2](https://sourceforge.net/projects/maxbin2/), and/or with [CONCOCT](https://github.com/BinPro/CONCOCT), and checks the quality of the genome bins using [Busco](https://busco.ezlab.org/), [CheckM](https://ecogenomics.github.io/CheckM/), or [CheckM2](https://github.com/chklovski/CheckM2) and optionally [GUNC](https://grp-bork.embl-community.io/gunc/).
- Performs ancient DNA validation and repair with [pyDamage](https://github.com/maxibor/pydamage) and [freebayes](https://github.com/freebayes/freebayes)
- optionally refines bins with [DAS Tool](https://github.com/cmks/DAS_Tool)
- assigns taxonomy to bins using [GTDB-Tk](https://github.com/Ecogenomics/GTDBTk) and/or [CAT](https://github.com/dutilh/CAT) and optionally identifies viruses in assemblies using [geNomad](https://github.com/apcamargo/genomad), or Eukaryotes with [Tiara](https://github.com/ibe-uw/tiara)
Expand Down Expand Up @@ -90,6 +90,7 @@ Other code contributors include:
- [Phil Palmer](https://github.com/PhilPalmer)
- [@willros](https://github.com/willros)
- [Adam Rosenbaum](https://github.com/muabnezor)
- [Diego Alvarez](https://github.com/dialvarezs)

Long read processing was inspired by [caspargross/HybridAssembly](https://github.com/caspargross/HybridAssembly) written by Caspar Gross [@caspargross](https://github.com/caspargross)

Expand Down
82 changes: 43 additions & 39 deletions bin/combine_tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,9 @@
## Originally written by Daniel Straub and Sabrina Krakau and released under the MIT license.
## See git repository (https://github.com/nf-core/mag) for full license text.


import sys
import argparse
import os.path
import sys

import pandas as pd


Expand All @@ -19,19 +18,14 @@ def parse_args(args=None):
metavar="FILE",
help="Bin depths summary file.",
)
parser.add_argument("-b", "--binqc_summary", metavar="FILE", help="BUSCO summary file.")
parser.add_argument("-q", "--quast_summary", metavar="FILE", help="QUAST BINS summary file.")
parser.add_argument("-g", "--gtdbtk_summary", metavar="FILE", help="GTDB-Tk summary file.")
parser.add_argument("-a", "--cat_summary", metavar="FILE", help="CAT table file.")
parser.add_argument(
"-b", "--busco_summary", metavar="FILE", help="BUSCO summary file."
)
parser.add_argument(
"-c", "--checkm_summary", metavar="FILE", help="CheckM summary file."
)
parser.add_argument(
"-q", "--quast_summary", metavar="FILE", help="QUAST BINS summary file."
)
parser.add_argument(
"-g", "--gtdbtk_summary", metavar="FILE", help="GTDB-Tk summary file."
"-t", "--binqc_tool", help="Bin QC tool used", choices=["busco", "checkm", "checkm2"]
)
parser.add_argument("-a", "--cat_summary", metavar="FILE", help="CAT table file.")

parser.add_argument(
"-o",
"--out",
Expand Down Expand Up @@ -81,9 +75,7 @@ def parse_cat_table(cat_table):
)
# merge all rank columns into a single column
df["CAT_rank"] = (
df.filter(regex="rank_\d+")
.apply(lambda x: ";".join(x.dropna()), axis=1)
.str.lstrip()
df.filter(regex="rank_\d+").apply(lambda x: ";".join(x.dropna()), axis=1).str.lstrip()
)
# remove rank_* columns
df.drop(df.filter(regex="rank_\d+").columns, axis=1, inplace=True)
Expand All @@ -95,39 +87,36 @@ def main(args=None):
args = parse_args(args)

if (
not args.busco_summary
and not args.checkm_summary
not args.binqc_summary
and not args.quast_summary
and not args.gtdbtk_summary
):
sys.exit(
"No summary specified! Please specify at least BUSCO, CheckM or QUAST summary."
"No summary specified! "
"Please specify at least BUSCO, CheckM, CheckM2 or QUAST summary."
)

# GTDB-Tk can only be run in combination with BUSCO or CheckM
if args.gtdbtk_summary and not (args.busco_summary or args.checkm_summary):
# GTDB-Tk can only be run in combination with BUSCO, CheckM or CheckM2
if args.gtdbtk_summary and not args.binqc_summary:
sys.exit(
"Invalid parameter combination: GTDB-TK summary specified, but no BUSCO or CheckM summary!"
"Invalid parameter combination: "
"GTDB-TK summary specified, but no BUSCO, CheckM or CheckM2 summary!"
)

# handle bin depths
results = pd.read_csv(args.depths_summary, sep="\t")
results.columns = [
"Depth " + str(col) if col != "bin" else col for col in results.columns
]
results.columns = ["Depth " + str(col) if col != "bin" else col for col in results.columns]
bins = results["bin"].sort_values().reset_index(drop=True)

if args.busco_summary:
busco_results = pd.read_csv(args.busco_summary, sep="\t")
if not bins.equals(
busco_results["GenomeBin"].sort_values().reset_index(drop=True)
):
if args.binqc_summary and args.binqc_tool == "busco":
busco_results = pd.read_csv(args.binqc_summary, sep="\t")
if not bins.equals(busco_results["GenomeBin"].sort_values().reset_index(drop=True)):
sys.exit("Bins in BUSCO summary do not match bins in bin depths summary!")
results = pd.merge(
results, busco_results, left_on="bin", right_on="GenomeBin", how="outer"
) # assuming depths for all bins are given

if args.checkm_summary:
if args.binqc_summary and args.binqc_tool == "checkm":
use_columns = [
"Bin Id",
"Marker lineage",
Expand All @@ -147,22 +136,37 @@ def main(args=None):
"4",
"5+",
]
checkm_results = pd.read_csv(args.checkm_summary, usecols=use_columns, sep="\t")
checkm_results = pd.read_csv(args.binqc_summary, usecols=use_columns, sep="\t")
checkm_results["Bin Id"] = checkm_results["Bin Id"] + ".fa"
if not bins.equals(
checkm_results["Bin Id"].sort_values().reset_index(drop=True)
):
if not bins.equals(checkm_results["Bin Id"].sort_values().reset_index(drop=True)):
sys.exit("Bins in CheckM summary do not match bins in bin depths summary!")
results = pd.merge(
results, checkm_results, left_on="bin", right_on="Bin Id", how="outer"
) # assuming depths for all bins are given
results["Bin Id"] = results["Bin Id"].str.removesuffix(".fa")

if args.binqc_summary and args.binqc_tool == "checkm2":
use_columns = [
"Name",
"Completeness",
"Contamination",
"Completeness_Model_Used",
"Coding_Density",
"Translation_Table_Used",
"Total_Coding_Sequences",
]
checkm2_results = pd.read_csv(args.binqc_summary, usecols=use_columns, sep="\t")
checkm2_results["Name"] = checkm2_results["Name"] + ".fa"
if not set(checkm2_results["Name"]).issubset(set(bins)):
sys.exit("Bins in CheckM2 summary do not match bins in bin depths summary!")
results = pd.merge(
results, checkm2_results, left_on="bin", right_on="Name", how="outer"
) # assuming depths for all bins are given
results["Name"] = results["Name"].str.removesuffix(".fa")

if args.quast_summary:
quast_results = pd.read_csv(args.quast_summary, sep="\t")
if not bins.equals(
quast_results["Assembly"].sort_values().reset_index(drop=True)
):
if not bins.equals(quast_results["Assembly"].sort_values().reset_index(drop=True)):
sys.exit("Bins in QUAST summary do not match bins in bin depths summary!")
results = pd.merge(
results, quast_results, left_on="bin", right_on="Assembly", how="outer"
Expand Down
6 changes: 4 additions & 2 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -160,12 +160,14 @@ process {
cpus = { 8 * task.attempt }
memory = { 20.GB * task.attempt }
}

withName: MAXBIN2 {
errorStrategy = { task.exitStatus in [1, 255] ? 'ignore' : 'retry' }
}

withName: DASTOOL_DASTOOL {
errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : task.exitStatus == 1 ? 'ignore' : 'finish' }
}
//CheckM2 returns exit code 1 when Diamond doesn't find any hits
withName: CHECKM2_PREDICT {
errorStrategy = { task.exitStatus in (130..145) ? 'retry' : task.exitStatus == 1 ? 'ignore' : 'finish' }
}
}
34 changes: 30 additions & 4 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -405,7 +405,11 @@ process {
withName: CHECKM_LINEAGEWF {
tag = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}" }
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}_wf" }
publishDir = [path: { "${params.outdir}/GenomeBinning/QC/CheckM" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename }]
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: CHECKM_QA {
Expand All @@ -418,9 +422,31 @@ process {
]
}

withName: COMBINE_CHECKM_TSV {
ext.prefix = { "checkm_summary" }
publishDir = [path: { "${params.outdir}/GenomeBinning/QC" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename }]
withName: COMBINE_BINQC_TSV {
ext.prefix = { "${params.binqc_tool}_summary" }
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: CHECKM2_DATABASEDOWNLOAD {
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM2/checkm2_downloads" },
mode: params.publish_dir_mode, overwrite: false,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
enabled: params.save_checkm2_data
]
}

withName: CHECKM2_PREDICT {
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}" }
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM2" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: GUNC_DOWNLOADDB {
Expand Down
30 changes: 28 additions & 2 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -554,7 +554,7 @@ Besides the reference files or output files created by BUSCO, the following summ

#### CheckM

[CheckM](https://ecogenomics.github.io/CheckM/) CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage
[CheckM](https://ecogenomics.github.io/CheckM/) provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage

By default, nf-core/mag runs CheckM with the `check_lineage` workflow that places genome bins on a reference tree to define lineage-marker sets, to check for completeness and contamination based on lineage-specific marker genes. and then subsequently runs `qa` to generate the summary files.

Expand All @@ -564,7 +564,8 @@ By default, nf-core/mag runs CheckM with the `check_lineage` workflow that place
- `GenomeBinning/QC/CheckM/`
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]_qa.txt`: Detailed statistics about bins informing completeness and contamamination scores (output of `checkm qa`). This should normally be your main file to use to evaluate your results.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]_wf.tsv`: Overall summary file for completeness and contamination (output of `checkm lineage_wf`).
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/`: intermediate files for CheckM results, including CheckM generated annotations, log, lineage markers etc.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/`: Intermediate files for CheckM results, including CheckM generated annotations, log, lineage markers etc.
- `GenomeBinning/QC/`
- `checkm_summary.tsv`: A summary table of the CheckM results for all bins (output of `checkm qa`).

</details>
Expand All @@ -580,6 +581,31 @@ If the parameter `--save_checkm_reference` is set, additionally the used the Che

</details>

#### CheckM2

[CheckM2](https://github.com/chklovski/CheckM2) is a tool for assessing the quality of metagenome-derived genomes. It uses a machine learning approach to predict the completeness and contamination of a genome regardless of its taxonomic lineage.

<details markdown="1">
<summary>Output files</summary>

- `GenomeBinning/QC/CheckM2/`
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/quality_report.tsv`: Detailed statistics about bins informing completeness and contamamination scores. This should normally be your main file to use to evaluate your results.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/`: Intermediate files for CheckM2 results, including CheckM2 generated annotations, log, and DIAMOND alignment results.
- `GenomeBinning/QC/`
- `checkm2_summary.tsv`: A summary table of the CheckM2 results for all bins.

</details>

If the parameter `--save_checkm2_reference` is set, the CheckM2 reference datasets will be stored in the output directory.

<details markdown="1">
<summary>Output files</summary>

- `GenomeBinning/QC/CheckM2/`
- `checkm2_downloads/CheckM2_database/*.dmnd`: Diamond database used by CheckM2.

</details>

#### GUNC

[Genome UNClutterer (GUNC)](https://grp-bork.embl-community.io/gunc/index.html) is a tool for detection of chimerism and contamination in prokaryotic genomes resulting from mis-binning of genomic contigs from unrelated lineages. It does so by applying an entropy based score on taxonomic assignment and contig location of all genes in a genome. It is generally considered as a additional complement to CheckM results.
Expand Down
Loading
Loading