Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vgp assembly: flowcharts for meryl and purge_dups #4863

Merged
merged 3 commits into from
May 21, 2024

Conversation

abueg
Copy link
Contributor

@abueg abueg commented Mar 25, 2024

hello! 👋🏼

added some new graphics for the meryl section (trying to clarify how we're working on a collection for batch jobs -> merging the outputs), and the purge_dups section

@abueg abueg requested a review from a team as a code owner March 25, 2024 19:08
Delphine-L
Delphine-L previously approved these changes Apr 1, 2024
Copy link
Contributor

@Delphine-L Delphine-L left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The graphics look great! I suggested a different formulations in a couple of places, I think it could improve readability.

@@ -324,6 +324,10 @@ Meryl will allow us to generate the *k*-mer profile by decomposing the sequencin
>
{: .comment}

In order to do genome profile analysis, first we need the *k*-mer spectrum of the raw reads, which should hopefully contain information about the genome that you sequenced. These *k*-mers are used to build a histogram (the *k*-mer spectrum), and then the GenomeScope model that fits that data will help infer genome characteristics. To count *k*-mers, we first count them on the separate FASTA files, before merging the counts and generating a histogram based on that. This is a way of parallelizing our work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: In order to identify the key metrics of the genome (profile), we start by generating a histogram of the k-mer distribution in the raw reads (the k-mer spectrum). Then, GenomeScope creates a model fitting the spectrum that allows to estimate the genome characteristics. We work in parallel on each set of raw reads, creating a database of each k-mer counts, then merge all the databases by adding the counts of the k-mers to build the histogram.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorporated this wording with some edits, does this sound ok? let me know if not !!


![Post-processing with purge_dups](../../images/vgp_assembly/purge_dupspipeline.png "Purge_dups pipeline. Adapted from github.com/dfguan/purge_dups. Purge_dups is integrated in a multi-step pipeline consisting in three main substages. Red indicates the steps which require to use Minimap2.")

The way purging is incorporated in the VGP pipeline, first the **primary assembly** is purged, resulting in a clean (purged) primary assembly and a set of contigs that were *removed* from those contigs. These will often contain haplotigs representing alternate alleles. We would like to keep that in the alternate assembly, so the next step is adding (concatenating) this file to the original alternate assembly. This file then undergoes purging as well, to remove any junk or overlaps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Purging may be used in the VGP pipeline when there are suspicions of false duplications (Figure 1). In such cases, we start with a purging of the primary assembly, resulting in a clean (purged) primary assembly and a set of contigs that were removed from those contigs. These removed contigs will often contain haplotigs representing alternate alleles. We would like to keep that in the alternate assembly, so the next step is adding (concatenating) this file to the original alternate assembly. To make sure we don't introduce redundancies in the alternate assembly that way, we perform a purging on it as well to remove any junk or overlaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorporated! does this look OK?

Delphine-L
Delphine-L previously approved these changes Apr 1, 2024
clarifying text in kmer counting parallelization alt. text, thank you hexylena!

Co-authored-by: Helena <[email protected]>
@shiltemann
Copy link
Member

Thanks @abueg! and thanks for the review @Delphine-L !

@shiltemann shiltemann merged commit 408fd00 into galaxyproject:main May 21, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants