Proper concurrent querying of alleles #149

bricoletc · 2019-07-22T14:27:33Z

So the gramtools site & allele marker encoding is supposed to allow alleles to be queried concurrently

Here's the relevant excerpt from Sorina et al's paper

If the search cursor is next to the end of a site, all
alternative alleles from that site need to be queried. Their ends are sorted together
in the BWM because of the markers, so they can be queried concurrently by adding
the SA-interval of suffixes starting with all numbers marking that site (even and
odd).

Yet as it turns out this is not actually done in the code. Here's the relevant snippet:

gramtools/libgramtools/src/search/search.cpp

Lines 358 to 384 in 41a1dbd

    
           SearchStates get_allele_search_states(const Marker &site_boundary_marker, 
        
                                                 const SA_Interval &allele_marker_sa_interval, 
        
                                                 const SearchState &current_search_state, 
        
                                                 const PRG_Info &prg_info) { 
        
               SearchStates search_states = {}; 
        
               const auto first_sa_interval_index = allele_marker_sa_interval.first; 
        
               const auto last_sa_interval_index = allele_marker_sa_interval.second; 
        
               for (auto allele_marker_sa_index = first_sa_interval_index; 
        
                    allele_marker_sa_index <= last_sa_interval_index; 
        
                    ++allele_marker_sa_index) { 
        
                   SearchState search_state = current_search_state; 
        
                   search_state.sa_interval.first = allele_marker_sa_index; 
        
                   search_state.sa_interval.second = allele_marker_sa_index; 
        
                   search_state.variant_site_state 
        
                           = SearchVariantSiteState::within_variant_site; 
        
                   // Populate the cache with which site/allele combination the `SearchState` maps into. 
        
                   auto allele_number = get_allele_id(allele_marker_sa_index, 
        
                                                      prg_info); // The alleles are not sorted by ID in the SA, need a specific routine. 
        
                   search_state.variant_site_path.push_front(VariantLocus{site_boundary_marker,allele_number}); 
        
                   search_states.emplace_back(search_state);

The for loop initialises one size one SA Interval per allele.

The advantage of doing this is that we immediately specify the allele ID attached to each SearchState. But the disadvantage is that if you have 5000 alleles, you will have 5000 SearchStates. And each time one of those hits another variant site, it will seed one new SearchState per allele. Etc...

The text was updated successfully, but these errors were encountered:

bricoletc · 2019-07-22T14:30:51Z

Pretty sure this leaves a massive memory footprint in the index.

So I have implemented a change where we query all alleles marked with an even marker concurrently.

Meaning for a site AG 5C6T6A5 AG with three alleles, C and T get queried using a single SearchState.

To get the A, I think we need to change the data representation to: AG 5C6T6A6 AG. Otherwise we have to query that last allele on its own, because it sorts with the beginning of the site (pair of 5's), and we do not have a guarantee that it sorts with all the 6's in the suffix array.

bricoletc · 2019-07-22T14:46:36Z

Here's the result of running this change on the pf3k + DBLMSP vcf of Plasmodium on chromosomes 1 and 2 only:
Space_used.pdf

On the x are the files that build produces, here using k=8 and the --all-kmers flag.
The actual index is stored in kmers, kmers_stats, sa_intervals and paths. For the latter two we get about 3-fold size reduction!

Also in this dataset the fm-index and the various masks are relatively large but on all chromosomes and at k=11, the four files mentioned above take up 9.8 GB of the 11GB of the build files: sa_intervals is 3GB and paths is 5.7GB.

As for build time it went down from 107 to 66 seconds.

bricoletc · 2019-07-26T09:40:53Z

And now for quasimap. I mapped 23.9 million reads from a single Plasmodium sample to the index generated above, using 10 threads.

Commit	Run Time(h)	CPU Time(h)	Max. RAM (GB)	Avg RAM (GB)
`a6e9094` (before changes)	31	123	8.7	5.3
`8b02af1` (after changes)	6.8	30.9	2.8	2.2

So that's a 4-fold decrease in run-time!
RAM 2-3 fold decrease.
Also interesting to see that the difference between Max RAM and Avg RAM goes down: ratio is 1.64 before changes and 1.27 after. This could be due to less 'blowup' for reads mapping to variant-dense regions.

Sanity check: the coverage generated is exactly identical for allele_sum_coverage and for grouped_allele_counts_coverage.json. (Note that for the latter a simple diff is not enough and I had to check each site's dictionary; this could be due to multithreading creating allele groups in different order).

allele_base_coverage.json do not have exactly same output: 37 sites out of 88,538 differ.

iqbal-lab · 2019-07-26T09:53:02Z

this is great!

iqbal-lab · 2019-07-27T14:01:24Z

What does this mean for a whole genome prg BTW?

bricoletc · 2019-08-06T09:52:23Z

@iqbal-lab sorry for late reply. I've been working on the last part of this comment: #149 (comment)
I will run with those changes as soon as I'm convinced it workds and also full genome and report results!

bricoletc · 2019-08-07T08:12:39Z

I ran 0736327 on the Pf Chroms 1&2 (#149 (comment)) and runtime is down to 5.25 hours (CPU time: 24.16 horus). Max and average RAM don't change.

bricoletc · 2019-10-27T19:37:19Z

Whole genome: we can't build the index for kmer size of 10 or 11 on a6e9094 on yoda: it exceeds 116GB of peak RAM on k=11 (average RAM: 43.7GB). This is before proper concurrent allele querying.

After modifications, on 73e4835, we can build k=11 using 38Gbytes peak RAM and 12.4 GB average RAM.

So we get at least 3 fold RAM reduction.

In general I think the RAM and time reduction due to these changes is proportional to the average number of alleles per site genome-wide.

Good to close now @iqbal-lab ?

iqbal-lab · 2019-10-28T12:40:00Z

Agreed!

bricoletc added the enhancement label Jul 22, 2019

iqbal-lab closed this as completed Jul 27, 2019

iqbal-lab reopened this Jul 27, 2019

iqbal-lab closed this as completed Oct 28, 2019

bricoletc mentioned this issue Feb 3, 2020

Reducing index size #145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper concurrent querying of alleles #149

Proper concurrent querying of alleles #149

bricoletc commented Jul 22, 2019

bricoletc commented Jul 22, 2019

bricoletc commented Jul 22, 2019

bricoletc commented Jul 26, 2019

iqbal-lab commented Jul 26, 2019

iqbal-lab commented Jul 27, 2019

bricoletc commented Aug 6, 2019

bricoletc commented Aug 7, 2019 •

edited

Loading

bricoletc commented Oct 27, 2019 •

edited

Loading

iqbal-lab commented Oct 28, 2019

Proper concurrent querying of alleles #149

Proper concurrent querying of alleles #149

Comments

bricoletc commented Jul 22, 2019

bricoletc commented Jul 22, 2019

bricoletc commented Jul 22, 2019

bricoletc commented Jul 26, 2019

iqbal-lab commented Jul 26, 2019

iqbal-lab commented Jul 27, 2019

bricoletc commented Aug 6, 2019

bricoletc commented Aug 7, 2019 • edited Loading

bricoletc commented Oct 27, 2019 • edited Loading

iqbal-lab commented Oct 28, 2019

bricoletc commented Aug 7, 2019 •

edited

Loading

bricoletc commented Oct 27, 2019 •

edited

Loading