Index required for efficient random access for trees #684

jeromekelleher · 2020-06-18T07:35:05Z

Current TreeSequence.at() and other methods to seek to a particular tree work by iterating along the trees. This isn't very efficient, if we want to get a tree in the middle of the sequence.

Can we seek efficiently using the current indexes, or do we need to build some sort of interval tree?

jeromekelleher · 2020-06-18T07:36:31Z

Related: #24 and #4

grahamgower · 2020-06-18T08:32:05Z

The intervals are non-overlapping, so an interval tree should be unnecessary---just binary search the sorted list of start coordinates.

jeromekelleher · 2020-06-18T09:04:23Z

They are overlapping though, aren't they? We should be able to get them from the existing indexes though, by binary searching the in-order and out-order indexes...

jeromekelleher · 2020-06-18T09:04:57Z

See also #685

jeromekelleher · 2020-06-18T09:38:47Z

I've thought about this again, and no, I can't see anyway of doing this efficiently without a different type of index. The indexes we have tell us the order in which edges go in and out in a left-to-right traversal, basically by sorting on the left and right coordinates. Since the edges for a given tree are not adjacent within these lists in general, then binary searching for a given position will only give you the edges that start/end closest to that point. You can have long edges (say) spanning the entire interval which are not inserted/removed anywhere near this point along the genome.

So, we will need some sort of data structure that'll support interval overlap queries if we're going to implement this (and edge finding in general, #685) efficiently.

benjeffery · 2020-06-18T09:39:08Z

They are overlapping though, aren't they? We should be able to get them from the existing indexes though, by binary searching the in-order and out-order indexes...

To do this you'd need to find the first in edge i that has i.right > target, then you also find the first out edge o that has o.right > i.left. Then you could continue normal tree iteration from that point.
I'm not sure this is so simple as binary search though as in is not sorted by right and vice-versa.

benjeffery · 2020-06-18T09:39:43Z

Heh, seems we wrote at the same time.

grahamgower · 2020-06-18T09:48:09Z

Oh, yes. Sorry for the misunderstanding. So you possibly want nested containment lists. https://academic.oup.com/bioinformatics/article/23/11/1386/199545

jeromekelleher · 2020-06-18T09:53:30Z

To do this you'd need to find the first in edge i that has i.right > target, then you also find the first out edge o that has o.right > i.left. Then you could continue normal tree iteration from that point.

That's can't be true, can it? Suppose we have one edge (0, L, a, b) at spans the entire TS. This appears at the start of the edges_in list and the end of the edges_out list. Then, suppose we have lots of trees, and we want to produce the one at L / 2. Searching for L / 2 won't bring us anywhere near this edge in either the edges_in or edges_out liest.

benjeffery · 2020-06-18T09:55:48Z

Yes, long edges make this the same as normal iteration.

benjeffery · 2020-06-18T10:03:45Z

This looks promising: https://github.com/biocore-ntnu/ncls shame the C is so tied to the python. There is also this https://github.com/databio/AIList/ although it is GPL :(

jeromekelleher · 2020-09-29T14:08:24Z

I've renamed this issue, as it's really about what type of index we need for this access. There's a bunch of features we can build on this, and it's a significant new chunk of functionality, so I've created a project for it: https://github.com/tskit-dev/tskit/projects/5

Issue tskit-dev/tskit#684

benjeffery · 2022-06-15T15:24:13Z

Paper that might be useful about the "interval skip list" https://link.springer.com/chapter/10.1007/BFb0028258 "Searching an IS-list containing n intervals to find intervals overlapping a point takes expected time O(log n+L) where L is the number of matching intervals."

I'm thinking that using the AVL tree as in msprime might be the easiest way forward as at least that is familiar.

benjeffery · 2022-06-15T15:39:35Z

Apparently if the tree sequence is discrete then you can get a faster method. Not looked into this too deeply, but interesting nonetheless. https://link.springer.com/chapter/10.1007/978-3-642-10631-6_18

jeromekelleher · 2023-04-14T16:41:55Z

#2661 solves this partially by making seeking for the null tree much more efficient. There is still some linear operations, but they are much faster than before. It's unlikely we're doing to do anything much better in the medium term, so going to close this for now.

jeromekelleher mentioned this issue Jun 18, 2020

Find edge for a given mutation #685

Closed

jeromekelleher added C API Issue is about the C API enhancement New feature or request Python API Issue is about the Python API labels Sep 29, 2020

jeromekelleher changed the title ~~Efficient random access for trees~~ Index required for efficient random access for trees Sep 29, 2020

jeromekelleher mentioned this issue Nov 21, 2020

Random split polytomy #815

Merged

benjeffery mentioned this issue Jun 21, 2021

Port the tables and trees sections of the tskit tut tskit-dev/tutorials#75

Merged

hyanwong added a commit to hyanwong/tutorials that referenced this issue Jun 21, 2021

Add in a callout to add uses for random tree access

52275d1

Issue tskit-dev/tskit#684

hyanwong added a commit to hyanwong/tutorials that referenced this issue Jun 21, 2021

Add in a callout to add uses for random tree access

ab86d34

Issue tskit-dev/tskit#684

jeromekelleher mentioned this issue Dec 13, 2022

Efficient seeking to trees. #2661

Merged

jeromekelleher closed this as completed Apr 14, 2023

kitchensjn mentioned this issue Nov 11, 2024

Add a method for sparsely sampling trees from a tskit.treeSequence kitchensjn/terracotta#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index required for efficient random access for trees #684

Index required for efficient random access for trees #684

jeromekelleher commented Jun 18, 2020

jeromekelleher commented Jun 18, 2020

grahamgower commented Jun 18, 2020

jeromekelleher commented Jun 18, 2020

jeromekelleher commented Jun 18, 2020

jeromekelleher commented Jun 18, 2020

benjeffery commented Jun 18, 2020

benjeffery commented Jun 18, 2020

grahamgower commented Jun 18, 2020 •

edited

Loading

jeromekelleher commented Jun 18, 2020

benjeffery commented Jun 18, 2020

benjeffery commented Jun 18, 2020 •

edited

Loading

jeromekelleher commented Sep 29, 2020

benjeffery commented Jun 15, 2022 •

edited

Loading

benjeffery commented Jun 15, 2022

jeromekelleher commented Apr 14, 2023

Index required for efficient random access for trees #684

Index required for efficient random access for trees #684

Comments

jeromekelleher commented Jun 18, 2020

jeromekelleher commented Jun 18, 2020

grahamgower commented Jun 18, 2020

jeromekelleher commented Jun 18, 2020

jeromekelleher commented Jun 18, 2020

jeromekelleher commented Jun 18, 2020

benjeffery commented Jun 18, 2020

benjeffery commented Jun 18, 2020

grahamgower commented Jun 18, 2020 • edited Loading

jeromekelleher commented Jun 18, 2020

benjeffery commented Jun 18, 2020

benjeffery commented Jun 18, 2020 • edited Loading

jeromekelleher commented Sep 29, 2020

benjeffery commented Jun 15, 2022 • edited Loading

benjeffery commented Jun 15, 2022

jeromekelleher commented Apr 14, 2023

grahamgower commented Jun 18, 2020 •

edited

Loading

benjeffery commented Jun 18, 2020 •

edited

Loading

benjeffery commented Jun 15, 2022 •

edited

Loading