-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RTree
Index (+ ST_AsSVG
and simple ST_GenerateRandomPoints
)
#383
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ng. Also add reldebug to makefile
…e future if we want to
Thank you very much |
This rocks - thank you!!! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds the long-awaited support for spatial indexes by implementing a R-Tree based extension index type to DuckDB.
More info to come but here's a quick rundown.
What is an RTree?
RTree indexes store bounding boxes of geometries into a hierarchical tree-data structure where every node's bounding box also contains the bounding boxes of all child nodes, making it fast to check what geometries intersect a given bounding box, as you can quickly prune out a lot of candidates by recursively moving down the tree.
What can it do
The RTree index implemented in this PR is:
Although, creating the index on top of an already populated table will be much faster and produce a "better looking tree", i.e. a more efficient index as the bulk-loading algorithm is better at minimizing overlap between internal nodes, making searching the tree faster. We don't have any vacuuming implemented yet so if you may want to periodically drop and re-create the index if you notice that the index query performance is starting to degrade after a series of large table-spanning updates.
For now, and RTree index will ONLY be used to perform an "index scan" and speed up simple queries where a set of columns are selected from a table with a
WHERE
clause containing one of 10 binary spatial predicate functions filtering on the indexed geometry column in the table and a geometry value that is constant, i.e. able to be fully evaluated before the query is executed. The 10 binary predicates are:ST_Equals
ST_Intersects
ST_Touches
ST_Crosses
ST_Within
ST_Contains
ST_Overlaps
ST_Covers
ST_CoveredBy
ST_ContainsProperly
In the future, the plan is to add other optimizations, such as performing index scans for
ST_Distance
/ST_DWithin
and for speeding up index joins.How do I use it?
You can also pass options when creating the RTree index to configure the maximum and minimum capacity for each node within the tree. E.g.
The defaults are
min_node_capacity = 50
,max_node_capacity = 128
Microbenchmarks
On my machine, running the above query as part of a micro-benchmark with and without an index produces the following results:
Without index
With index
Creating the index itself (after loading the table) takes about 1.2s~.
Creating the index before loading the table takes about 5.3s~.
Running a similar query, but with a 158mb trimmed down parquet file of building footprints extract from overture:
Without index:
With index:
Technical details
Even though the R*-tree is usually regarded as the best-in-class variant, it is also significantly harder to implement properly, particularly when dealing with rebalancing. This implementation is more like a R+Tree in that it only stores row ids in the leaf nodes and rebalances itself during updates and deletes by propagating splits up the tree and re-inserting the entries of under-full nodes from the root naively.
The splitting algorithm however is based on the Corner Based Splitting (CBS)1 algorithm, which from my testing seems to produce very good looking trees considering how fast it is and how easy it is to implement. Resulting nodes look slightly better than the quadratic/linear split methods but maybe slightly worse than R*-splitting.
Bulk loads are performed using the Sort Tile Recursive (STR)2 technique, and all memory is tracked and handled by the buffer manager, and it also uses DuckDB's existing sorting operator as a pre-processing step, so it should be able to handle larger-than memory bulk loads without too much trouble. There's only a single larger buffer that's required to be kept in memory at all times during the bulk-load process which is going to be roughly
sqrt(ceil(total_number_of_leaf_entries / node_capacity)) * node_capacity * 40 bytes
. However this is also tracked by DuckDB so it will count towards any set memory limit.Additionally, row-ids are stored in sorted order in the leaf nodes, making updates and deletes much faster.
Future work
More rewrite-rules that make use of the index, such as index-joins and filter pushdown for more advanced filters, such as multiple intersection queries and distance filters.
Configurable leaf node capacity that can differ from the branch node capacity.
Optional Hilbert-sort bulk loading algorithm (if the user supplies a boundary).
Closes #7
This PR also adds
ST_AsSVG
. And aST_GenerateRandomPoints
table function that only works with rectangle inputs for now. It also adds new overloads forST_Hilbert
.Footnotes
Sleit, Azzam & Al-Nsour, Esam. (2014). Corner-based splitting: An improved node splitting algorithm for R-tree. Journal of Information Science. 40. 222-236. 10.1177/0165551513516709. ↩
S. T. Leutenegger, M. A. Lopez and J. Edgington, "STR: a simple and efficient algorithm for R-tree packing," Proceedings 13th International Conference on Data Engineering, Birmingham, UK, 1997, pp. 497-506, doi: 10.1109/ICDE.1997.582015. ↩