Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python algorithm without S avl tree #2121

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

GertjanBisschop
Copy link
Member

Proposal for adapting algorithms.py (see #1993) such that we can track the number of samples each segment is ancestral to without having to rely on AVL tree S.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic, a major simplification!

algorithms.py Outdated Show resolved Hide resolved
@jeromekelleher
Copy link
Member

@benjeffery any idea what's up with CircleCI here?

@jeromekelleher
Copy link
Member

Can we remove the bintrees dependency or are we still using it somewhere?

I think the main issues here are with the verification code - how much have you run this? The tests run it a little bit but it's good to run the algorithm for a longish time to really shake out weird problems.

@benjeffery
Copy link
Member

@benjeffery any idea what's up with CircleCI here?

Try pushing again, seen some discussion that this might be transitory.

@GertjanBisschop
Copy link
Member Author

Can we remove the bintrees dependency or are we still using it somewhere?

I think the main issues here are with the verification code - how much have you run this? The tests run it a little bit but it's good to run the algorithm for a longish time to really shake out weird problems.

Don't think I have run it enough then. Will do that.

@GertjanBisschop
Copy link
Member Author

GertjanBisschop commented Nov 7, 2022

Can we remove the bintrees dependency or are we still using it somewhere?

I think the main issues here are with the verification code - how much have you run this? The tests run it a little bit but it's good to run the algorithm for a longish time to really shake out weird problems.

bintrees is still used here: dtwf_generation()

@GertjanBisschop
Copy link
Member Author

First benchmarks are not looking good:

  • dev branch

Human: length: 75000000.0, sample size:1000
Function _run() {} Took 45.8196 seconds
Human: length: 75000000.0, sample size:10000
Function _run() {} Took 78.5620 seconds
Drosophila: length: 10000000, sample size:100
Function _run() {} Took 61.1237 seconds

  • main branch

Human: length: 75000000.0, sample size:1000
Function _run() {} Took 15.0323 seconds
Human: length: 75000000.0, sample size:10000
Function _run() {} Took 17.4346 seconds
Drosophila: length: 10000000, sample size:100
Function _run() {} Took 23.6431 seconds

This is for a single run for each of the parameter settings.
The performance penalty seems to be mostly due to the fact that when keeping track of ancestral_to per segment, being able to refactor the segment chain becomes quite a rare event. This is because we cannot merge two adjoining segments that are ancestral to a different number of samples. This leads to a clear increase in the total and mean number of segments (see graph). So instead of updating the avl tree. We seem to be spending our time looping over a larger number of segments.

Dev - branch: human like parameters, averaged across 50 reps.

human_dev

Main - branch:
human_main

@jeromekelleher
Copy link
Member

Dammit, didn't see that coming ☹️

I'm surprised that there's so many adjacent segments with identical nodes but different ancestral_to values, I thought that would be rare...

@jeromekelleher
Copy link
Member

Can you push the changes to the C code please @GertjanBisschop - I'd like to have a look at the segment squashing logic

@jeromekelleher
Copy link
Member

Hmm, maybe we're not defragging often enough. Can you set it up so that we always defrag the segment chain after msp_merge_two_segments (probably just set defrag_required=true) and look at those numbers of segment plots again?

@GertjanBisschop
Copy link
Member Author

This does not have any impact:
human_main_defrag

@jeromekelleher
Copy link
Member

This is a real head scratcher... The extra segments all seem to be late in the simulation, when they should be pretty widely separated within the lineage (I would have thought!)

Can you paste in the top of the perf report (from a full run) for the Drosophila example please?

@GertjanBisschop
Copy link
Member Author

Sorry @jeromekelleher, forgot to mention that the time (x-axis) is in generations, and not on the coalescent time scale. So this is the very beginning of the coalescent process.

@GertjanBisschop
Copy link
Member Author

Perf report for the Drosophila example:

35.37% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_merge_two_ancestors
8.90% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] rate_map_position_to_mass
7.67% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] avl_rebalance
6.68% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] fenwick_set_value
6.32% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_defrag_segment_chain
5.31% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] avl_at
4.66% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] fenwick_find
4.35% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] cmp_individual
2.90% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] avl_search_closest
2.57% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_set_segment_mass
2.06% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] fenwick_increment
1.62% python3 libm-2.31.so [.] __log1p
1.42% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] rate_map_mass_between
0.82% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_run
0.77% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_choose_uniform_breakpoint
0.63% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_get_segment.isra.0
0.61% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] avl_unlink_node
0.45% python3 libgsl.so.23.1.0 [.] gsl_rng_uniform_int
0.39% python3 libc-2.31.so [.] msort_with_tmp.part.0
0.36% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] cmp_node_mapping
0.28% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] fenwick_get_cumulative_sum
0.25% python3 libgsl.so.23.1.0 [.] gsl_ran_exponential

@jeromekelleher
Copy link
Member

Yes, that looks about right for your hypothesis. I'm still surprised there are so many adjacent, squashable segments with different ancestral_to values - how can this happen?

@GertjanBisschop
Copy link
Member Author

For every pair of overlapping segments with non-corresponding starts and ends I think you create three squashable segments in the new merged lineage. One section will correspond to a coalescence event (ancestral_to = sum of both) and the two other bordering segments have the ancestral_to value of one of the original segments.

@jeromekelleher
Copy link
Member

Right, I guess that's it.

At times I've wondered if actually keeping track of these values is worth it at all. In principle, you could just look through all segments periodically and when you see an interval that contains only one segment, throw it away. There would a tradeoff, of course, between the frequency of doing this vs generated unnecessary recombination events. Do you think this would be worth prototyping?

@petrelharp
Copy link
Contributor

For every pair of overlapping segments with non-corresponding starts and ends I think you create three squashable segments in the new merged lineage. One section will correspond to a coalescence event (ancestral_to = sum of both) and the two other bordering segments have the ancestral_to value of one of the original segments.

Gee, this sounds related to the 'extend edges' thing again! (although maybe only tangentially)

@jeromekelleher
Copy link
Member

Gee, this sounds related to the 'extend edges' thing again! (although maybe only tangentially)

Hah, yes! Really great point Peter.

This is definitely worth exploring, but we should look at it in the algorithms.py before diving into C. What we want to do is do merge_ancestors in two passes: first check if there's any coalescence, and if so we map both lineages to the new node at all positions along the genome whether they coalesce or not. This is something we'll want to support in msprime anyway, so good to explore it in this context in case it fixes these performance issues.

@GertjanBisschop
Copy link
Member Author

Cool. These issues are related indeed.
Just thinking. When mapping both lineages to the new node, wouldn't we still have to merge all segments of both original lineages? So we would still be iterating over the extra segments we create by using ancestral_to.

@jeromekelleher
Copy link
Member

So we would still be iterating over the extra segments we create by using ancestral_to.

Yes, you're right. So maybe this idea of "keeping more of the ARG" is orthogonal and we should investigate any performance benefits we might get separately. Let's talk about this offline.

A thought that just struck me then is that if we are keeping this extra ARG information then we could probably take the max of the ancestral_to values in adjacent squashable segments, and it might all work out quite nicely.

@not-a-feature
Copy link

Additional reason to switch is that bintrees is deprecated and the development has stopped:
https://github.com/mozman/bintrees

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants