-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index required for efficient random access for trees #684
Comments
The intervals are non-overlapping, so an interval tree should be unnecessary---just binary search the sorted list of start coordinates. |
They are overlapping though, aren't they? We should be able to get them from the existing indexes though, by binary searching the in-order and out-order indexes... |
See also #685 |
I've thought about this again, and no, I can't see anyway of doing this efficiently without a different type of index. The indexes we have tell us the order in which edges go in and out in a left-to-right traversal, basically by sorting on the left and right coordinates. Since the edges for a given tree are not adjacent within these lists in general, then binary searching for a given position will only give you the edges that start/end closest to that point. You can have long edges (say) spanning the entire interval which are not inserted/removed anywhere near this point along the genome. So, we will need some sort of data structure that'll support interval overlap queries if we're going to implement this (and edge finding in general, #685) efficiently. |
To do this you'd need to find the first in edge |
Heh, seems we wrote at the same time. |
Oh, yes. Sorry for the misunderstanding. So you possibly want nested containment lists. https://academic.oup.com/bioinformatics/article/23/11/1386/199545 |
That's can't be true, can it? Suppose we have one edge (0, L, a, b) at spans the entire TS. This appears at the start of the edges_in list and the end of the edges_out list. Then, suppose we have lots of trees, and we want to produce the one at L / 2. Searching for L / 2 won't bring us anywhere near this edge in either the edges_in or edges_out liest. |
Yes, long edges make this the same as normal iteration. |
This looks promising: https://github.com/biocore-ntnu/ncls shame the C is so tied to the python. There is also this https://github.com/databio/AIList/ although it is GPL :( |
I've renamed this issue, as it's really about what type of index we need for this access. There's a bunch of features we can build on this, and it's a significant new chunk of functionality, so I've created a project for it: https://github.com/tskit-dev/tskit/projects/5 |
Paper that might be useful about the "interval skip list" https://link.springer.com/chapter/10.1007/BFb0028258 "Searching an IS-list containing n intervals to find intervals overlapping a point takes expected time O(log n+L) where L is the number of matching intervals." I'm thinking that using the AVL tree as in msprime might be the easiest way forward as at least that is familiar. |
Apparently if the tree sequence is discrete then you can get a faster method. Not looked into this too deeply, but interesting nonetheless. https://link.springer.com/chapter/10.1007/978-3-642-10631-6_18 |
#2661 solves this partially by making seeking for the null tree much more efficient. There is still some linear operations, but they are much faster than before. It's unlikely we're doing to do anything much better in the medium term, so going to close this for now. |
Current TreeSequence.at() and other methods to seek to a particular tree work by iterating along the trees. This isn't very efficient, if we want to get a tree in the middle of the sequence.
Can we seek efficiently using the current indexes, or do we need to build some sort of interval tree?
The text was updated successfully, but these errors were encountered: