Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper concurrent querying of alleles #149

Closed
bricoletc opened this issue Jul 22, 2019 · 9 comments
Closed

Proper concurrent querying of alleles #149

bricoletc opened this issue Jul 22, 2019 · 9 comments

Comments

@bricoletc
Copy link
Member

So the gramtools site & allele marker encoding is supposed to allow alleles to be queried concurrently

Here's the relevant excerpt from Sorina et al's paper

If the search cursor is next to the end of a site, all
alternative alleles from that site need to be queried. Their ends are sorted together
in the BWM because of the markers, so they can be queried concurrently by adding
the SA-interval of suffixes starting with all numbers marking that site (even and
odd).

Yet as it turns out this is not actually done in the code. Here's the relevant snippet:

SearchStates get_allele_search_states(const Marker &site_boundary_marker,
const SA_Interval &allele_marker_sa_interval,
const SearchState &current_search_state,
const PRG_Info &prg_info) {
SearchStates search_states = {};
const auto first_sa_interval_index = allele_marker_sa_interval.first;
const auto last_sa_interval_index = allele_marker_sa_interval.second;
for (auto allele_marker_sa_index = first_sa_interval_index;
allele_marker_sa_index <= last_sa_interval_index;
++allele_marker_sa_index) {
SearchState search_state = current_search_state;
search_state.sa_interval.first = allele_marker_sa_index;
search_state.sa_interval.second = allele_marker_sa_index;
search_state.variant_site_state
= SearchVariantSiteState::within_variant_site;
// Populate the cache with which site/allele combination the `SearchState` maps into.
auto allele_number = get_allele_id(allele_marker_sa_index,
prg_info); // The alleles are not sorted by ID in the SA, need a specific routine.
search_state.variant_site_path.push_front(VariantLocus{site_boundary_marker,allele_number});
search_states.emplace_back(search_state);

The for loop initialises one size one SA Interval per allele.

The advantage of doing this is that we immediately specify the allele ID attached to each SearchState. But the disadvantage is that if you have 5000 alleles, you will have 5000 SearchStates. And each time one of those hits another variant site, it will seed one new SearchState per allele. Etc...

@bricoletc
Copy link
Member Author

Pretty sure this leaves a massive memory footprint in the index.

So I have implemented a change where we query all alleles marked with an even marker concurrently.

Meaning for a site AG 5C6T6A5 AG with three alleles, C and T get queried using a single SearchState.

To get the A, I think we need to change the data representation to: AG 5C6T6A6 AG. Otherwise we have to query that last allele on its own, because it sorts with the beginning of the site (pair of 5's), and we do not have a guarantee that it sorts with all the 6's in the suffix array.

@bricoletc
Copy link
Member Author

Here's the result of running this change on the pf3k + DBLMSP vcf of Plasmodium on chromosomes 1 and 2 only:
Space_used.pdf

On the x are the files that build produces, here using k=8 and the --all-kmers flag.
The actual index is stored in kmers, kmers_stats, sa_intervals and paths. For the latter two we get about 3-fold size reduction!

Also in this dataset the fm-index and the various masks are relatively large but on all chromosomes and at k=11, the four files mentioned above take up 9.8 GB of the 11GB of the build files: sa_intervals is 3GB and paths is 5.7GB.

As for build time it went down from 107 to 66 seconds.

@bricoletc
Copy link
Member Author

And now for quasimap. I mapped 23.9 million reads from a single Plasmodium sample to the index generated above, using 10 threads.

Commit Run Time(h) CPU Time(h) Max. RAM (GB) Avg RAM (GB)
a6e9094 (before changes) 31 123 8.7 5.3
8b02af1 (after changes) 6.8 30.9 2.8 2.2

So that's a 4-fold decrease in run-time!
RAM 2-3 fold decrease.
Also interesting to see that the difference between Max RAM and Avg RAM goes down: ratio is 1.64 before changes and 1.27 after. This could be due to less 'blowup' for reads mapping to variant-dense regions.

Sanity check: the coverage generated is exactly identical for allele_sum_coverage and for grouped_allele_counts_coverage.json. (Note that for the latter a simple diff is not enough and I had to check each site's dictionary; this could be due to multithreading creating allele groups in different order).

allele_base_coverage.json do not have exactly same output: 37 sites out of 88,538 differ.

@iqbal-lab
Copy link
Collaborator

this is great!

@iqbal-lab
Copy link
Collaborator

What does this mean for a whole genome prg BTW?

@iqbal-lab iqbal-lab reopened this Jul 27, 2019
@bricoletc
Copy link
Member Author

@iqbal-lab sorry for late reply. I've been working on the last part of this comment: #149 (comment)
I will run with those changes as soon as I'm convinced it workds and also full genome and report results!

@bricoletc
Copy link
Member Author

bricoletc commented Aug 7, 2019

I ran 0736327 on the Pf Chroms 1&2 (#149 (comment)) and runtime is down to 5.25 hours (CPU time: 24.16 horus). Max and average RAM don't change.

@bricoletc
Copy link
Member Author

bricoletc commented Oct 27, 2019

Whole genome: we can't build the index for kmer size of 10 or 11 on a6e9094 on yoda: it exceeds 116GB of peak RAM on k=11 (average RAM: 43.7GB). This is before proper concurrent allele querying.

After modifications, on 73e4835, we can build k=11 using 38Gbytes peak RAM and 12.4 GB average RAM.

So we get at least 3 fold RAM reduction.

In general I think the RAM and time reduction due to these changes is proportional to the average number of alleles per site genome-wide.

Good to close now @iqbal-lab ?

@iqbal-lab
Copy link
Collaborator

Agreed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants