-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vgp assembly: flowcharts for meryl and purge_dups #4863
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The graphics look great! I suggested a different formulations in a couple of places, I think it could improve readability.
@@ -324,6 +324,10 @@ Meryl will allow us to generate the *k*-mer profile by decomposing the sequencin | |||
> | |||
{: .comment} | |||
|
|||
In order to do genome profile analysis, first we need the *k*-mer spectrum of the raw reads, which should hopefully contain information about the genome that you sequenced. These *k*-mers are used to build a histogram (the *k*-mer spectrum), and then the GenomeScope model that fits that data will help infer genome characteristics. To count *k*-mers, we first count them on the separate FASTA files, before merging the counts and generating a histogram based on that. This is a way of parallelizing our work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: In order to identify the key metrics of the genome (profile), we start by generating a histogram of the k-mer distribution in the raw reads (the k-mer spectrum). Then, GenomeScope creates a model fitting the spectrum that allows to estimate the genome characteristics. We work in parallel on each set of raw reads, creating a database of each k-mer counts, then merge all the databases by adding the counts of the k-mers to build the histogram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorporated this wording with some edits, does this sound ok? let me know if not !!
|
||
![Post-processing with purge_dups](../../images/vgp_assembly/purge_dupspipeline.png "Purge_dups pipeline. Adapted from github.com/dfguan/purge_dups. Purge_dups is integrated in a multi-step pipeline consisting in three main substages. Red indicates the steps which require to use Minimap2.") | ||
|
||
The way purging is incorporated in the VGP pipeline, first the **primary assembly** is purged, resulting in a clean (purged) primary assembly and a set of contigs that were *removed* from those contigs. These will often contain haplotigs representing alternate alleles. We would like to keep that in the alternate assembly, so the next step is adding (concatenating) this file to the original alternate assembly. This file then undergoes purging as well, to remove any junk or overlaps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Purging may be used in the VGP pipeline when there are suspicions of false duplications (Figure 1). In such cases, we start with a purging of the primary assembly, resulting in a clean (purged) primary assembly and a set of contigs that were removed from those contigs. These removed contigs will often contain haplotigs representing alternate alleles. We would like to keep that in the alternate assembly, so the next step is adding (concatenating) this file to the original alternate assembly. To make sure we don't introduce redundancies in the alternate assembly that way, we perform a purging on it as well to remove any junk or overlaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorporated! does this look OK?
clarifying text in kmer counting parallelization alt. text, thank you hexylena! Co-authored-by: Helena <[email protected]>
Thanks @abueg! and thanks for the review @Delphine-L ! |
hello! 👋🏼
added some new graphics for the meryl section (trying to clarify how we're working on a collection for batch jobs -> merging the outputs), and the purge_dups section