Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CheckM as alternative to BUSCO #350

Merged
merged 26 commits into from
Jan 17, 2023
Merged

Add support for CheckM as alternative to BUSCO #350

merged 26 commits into from
Jan 17, 2023

Conversation

jfy133
Copy link
Member

@jfy133 jfy133 commented Nov 4, 2022

Note this does not include the integration with the bin_summary reports etc., I will need somone else to contribute that as it reuqires python script updates (which I'm not really comfortble with)

Closes #326

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
    • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions
Copy link

github-actions bot commented Nov 4, 2022

This PR is against the master branch ❌

  • Do not close this PR
  • Click Edit and change the base to dev
  • This CI test will remain failed until you push a new commit

Hi @jfy133,

It looks like this pull-request is has been made against the nf-core/mag master branch.
The master branch on nf-core repositories should always contain code from the latest release.
Because of this, PRs to master are only allowed if they come from the nf-core/mag dev branch.

You do not need to close this PR, you can change the target branch to dev by clicking the "Edit" button at the top of this page.
Note that even after this, the test will continue to show as failing until you push a new commit.

Thanks again for your contribution!

@jfy133 jfy133 changed the base branch from master to dev November 4, 2022 08:37
@github-actions
Copy link

github-actions bot commented Nov 4, 2022

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit d5b66a1

+| ✅ 156 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   1 tests had warnings |!

❗ Test warnings:

  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your prefered methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy

✅ Tests passed:

Run details

  • nf-core/tools version 2.7.2
  • Run at 2023-01-17 08:50:21

@skrakau skrakau self-assigned this Nov 18, 2022
CHANGELOG.md Outdated
@@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### `Changed`

- [#340](https://github.com/nf-core/mag/pull/340) - Update to nf-core 2.6.1 `TEMPLATE`
- [#350](https://github.com/nf-core/mag/pull/350) - Adds support for CheckM as alternative bin completeness and QC tool (added by @jfy133).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a new convention how to state individual contributions? One could also think about changing the contributions section in the README...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly, but if you tag people in the README, on release they get shown as contributors to that specific release which is a nice acknowledgment of the specific work they do (rather than just the general 'catch all' on the README. I did this on the last 2 MAG releases too:

https://github.com/nf-core/mag/releases/tag/2.2.1
https://github.com/nf-core/mag/releases/tag/2.2.0

But I'm happy to remove it if you don' tlike it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I just thought it would make sense to do this consistently

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also do that! Tell me which way you prefer and I'll update accordingly :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you do this in other pipelines as well? I don't mind, we can keep it like this

Copy link
Member Author

@jfy133 jfy133 Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I don't mind either way. I'll suggest then mention each person for each PR in the CHANGELOG (then it makes it easier to just paste the changelog straight into the release notes and it'll already be there, rather than us having to go back and check each PR afterwards), but ultimately this is your pipeline so you can choose ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I was thinking again about this issue and if I should add my name now or not, and in the end I would somehow prefer not to start this (at least as long it's not nf-core wide like this), because then everyone who is not doing this would be excluded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, well actually in a later PR (from yday), I've gone back through all the merged PRs for the upcoming release and I've added everyone:

#350

So we won't have missed anyone now. Note that these tags only reflect what is displayed in each release (not the contribtuors thorughout the entire project as a whole, which is still displayed here).

Things such as this are not necessary to be nf-core wide to be fair. But as this displaying of the contributors in the release stuff is pretty new anyway, I was sort of thinking maybe nf-core/mag could be the 'forerunner' of a nice practise ;).

But if you prefer, I could also do a poll on slack to ask if people think we should do it nf-core wide or not?

I sort of like it because it again makes it much more visible peoples contributors, and will help encourage people to keep contributing in the long run. But it's up to you! I can also remove all the tags from PR #350 again if you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean #380, sorry didn't see it.

A poll would be nice though, then we could see how to do it consistently. Even if it doesn't necessarily has to be consistent, it just easier to stick to a certain way across multiple pipelines.
So don't remove it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skrakau
Copy link
Member

skrakau commented Nov 18, 2022

I will have a more thorough look here and also look at the python scripts asap :)

@skrakau
Copy link
Member

skrakau commented Nov 30, 2022

Hi @jfy133 and @d4straub, do you have any opinion about which of the checkm metrics should end up in the bin_summary?
e.g. Bin Id Marker lineage # genomes # markers # marker sets Completeness Contamination Strain heterogeneity or more?

Ok, maybe the question is more: the output of checkm lineage_wf or checkm qa?

Edit: will just use all checkm qa results, if you don't object

@jfy133
Copy link
Member Author

jfy133 commented Nov 30, 2022

@alexhbnr and @maxibor might also have thoughts on this

@jfy133
Copy link
Member Author

jfy133 commented Nov 30, 2022

But I believe Alex only actually uses output of qa

@alexhbnr
Copy link
Contributor

alexhbnr commented Nov 30, 2022

Yes, I usually first run the wf command to get the output and use the qa command with the additional parameters -o 2 --tab_table to get a nicely formatted TSV table. The latter is required to merge the output with the tool GUNC, which provides yet another approach to estimate the contamination of a MAG.

@d4straub
Copy link
Collaborator

I actually have no preference for CheckM, because I am not used to it. Usually the more the merrier ;)

@jfy133
Copy link
Member Author

jfy133 commented Dec 1, 2022

@alexhbnr what are your preferred columns?

@alexhbnr
Copy link
Contributor

alexhbnr commented Dec 1, 2022

Here is the list that I usually would have a look at:

  • Marker lineage
  • Completeness
  • Contamination
  • Strain heterogeneity
  • Genome size (bp)
  • "# contigs"
  • N50 (contigs)
  • GC
  • Coding density

@skrakau
Copy link
Member

skrakau commented Dec 2, 2022

  • I added a checkm_summary.tsv file including all bins, as for BUSCO and QUAST
  • add using the CheckM results also for the filtering step prior running GTDB-Tk
  • and included the CheckM QA results into bin_summary.tsv (currently all columns)

@jfy133 maybe you can have a look if this seems alright to you.

However, the bin_summary.tsv needs to be adjusted. So far the original column names are used, but QUAST and CheckM generate redundant results, with overlapping column names (e.g. "# contigs", -> becomes "# contigs_x", "# contigs_y"). Also this might get quite confusing if there are multiple N50* columns from different tools, but maybe one can expect that the user knows the exact column names from the different tools. Any preferences how to handle this?

@jfy133
Copy link
Member Author

jfy133 commented Dec 2, 2022

  • Table looks OK to me based on my minimal experience (I'll attach an example below, @alexhbnr , looks OK to you?)
  • bin_summary.tsv: I think keeping just QUAST would be most reliable. It would then give you consistent results regardless of the QC tool; and if you wanted to compare you can always look specifically at CheckM (/BUSCO, even if it doesn't produce that info) results. I don't think it makes sense to add loads of extra redundant columns as it makes it harder to evaluate.

checkm_summary.zip

@github-actions
Copy link

Python linting (black) is failing

To keep the code consistent with lots of contributors, we run automated code consistency checks.
To fix this CI test, please run:

  • Install black: pip install black
  • Fix formatting errors in your pipeline: black .

Once you push these changes the test should pass, and you can hide this comment 👍

We highly recommend setting up Black in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!

Thanks again for your contribution!

@jfy133 jfy133 requested a review from skrakau January 12, 2023 14:57
@skrakau
Copy link
Member

skrakau commented Jan 13, 2023

maybe worth mentioning: the CheckM results "Bin Id" does not contain a ".fa" extension and I didn't change it, while for the other tools and summaries the name for the bin does contain it.

docs/usage.md Outdated
Comment on lines 321 to 323
It is _highly_ recommended to pass utilise the `--checkm_db` flag to pass pre-downloaded and uncompressed directory of required CheckM references.

If you do not use this, this will result in the tool detecting the database is missing, and download the files for EVERY CheckM process that is executed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be the case anymore, right?

@jfy133 jfy133 requested a review from skrakau January 13, 2023 14:54
@jfy133
Copy link
Member Author

jfy133 commented Jan 13, 2023

ts "Bin Id" does not contain a ".fa" extension and I didn't change it, while for the other tools and summaries the name for the bin does contain it.

Do you mean inside the output table?

Otherwise I should have adressed all your other comments @skrakau :)

@jfy133
Copy link
Member Author

jfy133 commented Jan 13, 2023

Ok for some reason the Black python linting error is now coming up even though I didn't touch that ... when I black locally nothing changes... could you try running it and see if it fixes it for you @skrakau ?

@skrakau
Copy link
Member

skrakau commented Jan 17, 2023

ts "Bin Id" does not contain a ".fa" extension and I didn't change it, while for the other tools and summaries the name for the bin does contain it.

Do you mean inside the output table?

Yes

Copy link
Member

@skrakau skrakau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@jfy133 jfy133 merged commit 36f879d into dev Jan 17, 2023
@jfy133 jfy133 deleted the checkm branch January 17, 2023 10:48
@jfy133 jfy133 mentioned this pull request Feb 27, 2023
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants