-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper concurrent querying of alleles #149
Comments
Pretty sure this leaves a massive memory footprint in the index. So I have implemented a change where we query all alleles marked with an even marker concurrently. Meaning for a site AG 5C6T6A5 AG with three alleles, C and T get queried using a single To get the A, I think we need to change the data representation to: AG 5C6T6A6 AG. Otherwise we have to query that last allele on its own, because it sorts with the beginning of the site (pair of 5's), and we do not have a guarantee that it sorts with all the 6's in the suffix array. |
Here's the result of running this change on the pf3k + DBLMSP vcf of Plasmodium on chromosomes 1 and 2 only: On the x are the files that Also in this dataset the fm-index and the various masks are relatively large but on all chromosomes and at k=11, the four files mentioned above take up 9.8 GB of the 11GB of the As for build time it went down from 107 to 66 seconds. |
And now for
So that's a 4-fold decrease in run-time! Sanity check: the coverage generated is exactly identical for
|
this is great! |
What does this mean for a whole genome prg BTW? |
@iqbal-lab sorry for late reply. I've been working on the last part of this comment: #149 (comment) |
I ran 0736327 on the Pf Chroms 1&2 (#149 (comment)) and runtime is down to 5.25 hours (CPU time: 24.16 horus). Max and average RAM don't change. |
Whole genome: we can't build the index for kmer size of 10 or 11 on a6e9094 on yoda: it exceeds 116GB of peak RAM on k=11 (average RAM: 43.7GB). This is before proper concurrent allele querying. After modifications, on 73e4835, we can build k=11 using 38Gbytes peak RAM and 12.4 GB average RAM. So we get at least 3 fold RAM reduction. In general I think the RAM and time reduction due to these changes is proportional to the average number of alleles per site genome-wide. Good to close now @iqbal-lab ? |
Agreed! |
So the gramtools site & allele marker encoding is supposed to allow alleles to be queried concurrently
Here's the relevant excerpt from Sorina et al's paper
Yet as it turns out this is not actually done in the code. Here's the relevant snippet:
gramtools/libgramtools/src/search/search.cpp
Lines 358 to 384 in 41a1dbd
The for loop initialises one size one SA Interval per allele.
The advantage of doing this is that we immediately specify the allele ID attached to each
SearchState
. But the disadvantage is that if you have 5000 alleles, you will have 5000 SearchStates. And each time one of those hits another variant site, it will seed one newSearchState
per allele. Etc...The text was updated successfully, but these errors were encountered: