-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python algorithm without S avl tree #2121
base: main
Are you sure you want to change the base?
Python algorithm without S avl tree #2121
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fantastic, a major simplification!
@benjeffery any idea what's up with CircleCI here? |
Can we remove the bintrees dependency or are we still using it somewhere? I think the main issues here are with the verification code - how much have you run this? The tests run it a little bit but it's good to run the algorithm for a longish time to really shake out weird problems. |
Try pushing again, seen some discussion that this might be transitory. |
Don't think I have run it enough then. Will do that. |
bintrees is still used here: |
First benchmarks are not looking good:
Human: length: 75000000.0, sample size:1000
Human: length: 75000000.0, sample size:1000 This is for a single run for each of the parameter settings. Dev - branch: human like parameters, averaged across 50 reps. |
Dammit, didn't see that coming I'm surprised that there's so many adjacent segments with identical nodes but different ancestral_to values, I thought that would be rare... |
Can you push the changes to the C code please @GertjanBisschop - I'd like to have a look at the segment squashing logic |
Hmm, maybe we're not defragging often enough. Can you set it up so that we always defrag the segment chain after msp_merge_two_segments (probably just set |
This is a real head scratcher... The extra segments all seem to be late in the simulation, when they should be pretty widely separated within the lineage (I would have thought!) Can you paste in the top of the perf report (from a full run) for the Drosophila example please? |
Sorry @jeromekelleher, forgot to mention that the time (x-axis) is in generations, and not on the coalescent time scale. So this is the very beginning of the coalescent process. |
Perf report for the Drosophila example: 35.37% python3 _msprime.cpython-38-x86_64-linux-gnu.so [.] msp_merge_two_ancestors |
Yes, that looks about right for your hypothesis. I'm still surprised there are so many adjacent, squashable segments with different |
For every pair of overlapping segments with non-corresponding starts and ends I think you create three squashable segments in the new merged lineage. One section will correspond to a coalescence event (ancestral_to = sum of both) and the two other bordering segments have the ancestral_to value of one of the original segments. |
Right, I guess that's it. At times I've wondered if actually keeping track of these values is worth it at all. In principle, you could just look through all segments periodically and when you see an interval that contains only one segment, throw it away. There would a tradeoff, of course, between the frequency of doing this vs generated unnecessary recombination events. Do you think this would be worth prototyping? |
Gee, this sounds related to the 'extend edges' thing again! (although maybe only tangentially) |
Hah, yes! Really great point Peter. This is definitely worth exploring, but we should look at it in the algorithms.py before diving into C. What we want to do is do |
Cool. These issues are related indeed. |
Yes, you're right. So maybe this idea of "keeping more of the ARG" is orthogonal and we should investigate any performance benefits we might get separately. Let's talk about this offline. A thought that just struck me then is that if we are keeping this extra ARG information then we could probably take the max of the |
Additional reason to switch is that |
Proposal for adapting
algorithms.py
(see #1993) such that we can track the number of samples each segment is ancestral to without having to rely on AVL treeS
.