Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index required for efficient random access for trees #684

Closed
jeromekelleher opened this issue Jun 18, 2020 · 15 comments
Closed

Index required for efficient random access for trees #684

jeromekelleher opened this issue Jun 18, 2020 · 15 comments
Labels
C API Issue is about the C API enhancement New feature or request Python API Issue is about the Python API

Comments

@jeromekelleher
Copy link
Member

Current TreeSequence.at() and other methods to seek to a particular tree work by iterating along the trees. This isn't very efficient, if we want to get a tree in the middle of the sequence.

Can we seek efficiently using the current indexes, or do we need to build some sort of interval tree?

@jeromekelleher
Copy link
Member Author

Related: #24 and #4

@grahamgower
Copy link
Member

The intervals are non-overlapping, so an interval tree should be unnecessary---just binary search the sorted list of start coordinates.

@jeromekelleher
Copy link
Member Author

They are overlapping though, aren't they? We should be able to get them from the existing indexes though, by binary searching the in-order and out-order indexes...

@jeromekelleher
Copy link
Member Author

See also #685

@jeromekelleher
Copy link
Member Author

I've thought about this again, and no, I can't see anyway of doing this efficiently without a different type of index. The indexes we have tell us the order in which edges go in and out in a left-to-right traversal, basically by sorting on the left and right coordinates. Since the edges for a given tree are not adjacent within these lists in general, then binary searching for a given position will only give you the edges that start/end closest to that point. You can have long edges (say) spanning the entire interval which are not inserted/removed anywhere near this point along the genome.

So, we will need some sort of data structure that'll support interval overlap queries if we're going to implement this (and edge finding in general, #685) efficiently.

@benjeffery
Copy link
Member

They are overlapping though, aren't they? We should be able to get them from the existing indexes though, by binary searching the in-order and out-order indexes...

To do this you'd need to find the first in edge i that has i.right > target, then you also find the first out edge o that has o.right > i.left. Then you could continue normal tree iteration from that point.
I'm not sure this is so simple as binary search though as in is not sorted by right and vice-versa.

@benjeffery
Copy link
Member

Heh, seems we wrote at the same time.

@grahamgower
Copy link
Member

grahamgower commented Jun 18, 2020

Oh, yes. Sorry for the misunderstanding. So you possibly want nested containment lists. https://academic.oup.com/bioinformatics/article/23/11/1386/199545

@jeromekelleher
Copy link
Member Author

To do this you'd need to find the first in edge i that has i.right > target, then you also find the first out edge o that has o.right > i.left. Then you could continue normal tree iteration from that point.

That's can't be true, can it? Suppose we have one edge (0, L, a, b) at spans the entire TS. This appears at the start of the edges_in list and the end of the edges_out list. Then, suppose we have lots of trees, and we want to produce the one at L / 2. Searching for L / 2 won't bring us anywhere near this edge in either the edges_in or edges_out liest.

@benjeffery
Copy link
Member

Yes, long edges make this the same as normal iteration.

@benjeffery
Copy link
Member

benjeffery commented Jun 18, 2020

This looks promising: https://github.com/biocore-ntnu/ncls shame the C is so tied to the python. There is also this https://github.com/databio/AIList/ although it is GPL :(

@jeromekelleher jeromekelleher added C API Issue is about the C API enhancement New feature or request Python API Issue is about the Python API labels Sep 29, 2020
@jeromekelleher jeromekelleher changed the title Efficient random access for trees Index required for efficient random access for trees Sep 29, 2020
@jeromekelleher
Copy link
Member Author

I've renamed this issue, as it's really about what type of index we need for this access. There's a bunch of features we can build on this, and it's a significant new chunk of functionality, so I've created a project for it: https://github.com/tskit-dev/tskit/projects/5

@benjeffery
Copy link
Member

benjeffery commented Jun 15, 2022

Paper that might be useful about the "interval skip list" https://link.springer.com/chapter/10.1007/BFb0028258 "Searching an IS-list containing n intervals to find intervals overlapping a point takes expected time O(log n+L) where L is the number of matching intervals."

I'm thinking that using the AVL tree as in msprime might be the easiest way forward as at least that is familiar.

@benjeffery
Copy link
Member

Apparently if the tree sequence is discrete then you can get a faster method. Not looked into this too deeply, but interesting nonetheless. https://link.springer.com/chapter/10.1007/978-3-642-10631-6_18

@jeromekelleher
Copy link
Member Author

#2661 solves this partially by making seeking for the null tree much more efficient. There is still some linear operations, but they are much faster than before. It's unlikely we're doing to do anything much better in the medium term, so going to close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C API Issue is about the C API enhancement New feature or request Python API Issue is about the Python API
Projects
None yet
Development

No branches or pull requests

3 participants