Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RTree Index (+ ST_AsSVG and simple ST_GenerateRandomPoints) #383

Merged
merged 42 commits into from
Sep 5, 2024

Conversation

Maxxen
Copy link
Member

@Maxxen Maxxen commented Aug 30, 2024

This PR adds the long-awaited support for spatial indexes by implementing a R-Tree based extension index type to DuckDB.

More info to come but here's a quick rundown.

What is an RTree?

RTree indexes store bounding boxes of geometries into a hierarchical tree-data structure where every node's bounding box also contains the bounding boxes of all child nodes, making it fast to check what geometries intersect a given bounding box, as you can quickly prune out a lot of candidates by recursively moving down the tree.

What can it do

The RTree index implemented in this PR is:

  • persistent: When creating a disk-backed database, the index is persisted as part of the database file.
  • buffer managed: All memory used by the index is tracked by DuckDB's memory management system. While it currently won't "unload" blocks from memory (a limitation shared with DuckDB's ART-index), persisted nodes will be loaded from disk lazily, and the index memory usage will respect DuckDB's configurable memory limit (if set).
  • mutable : both inserts and deletes to the base table are supported after the index is created, and the index will update and rebalance itself accordingly.

Although, creating the index on top of an already populated table will be much faster and produce a "better looking tree", i.e. a more efficient index as the bulk-loading algorithm is better at minimizing overlap between internal nodes, making searching the tree faster. We don't have any vacuuming implemented yet so if you may want to periodically drop and re-create the index if you notice that the index query performance is starting to degrade after a series of large table-spanning updates.

For now, and RTree index will ONLY be used to perform an "index scan" and speed up simple queries where a set of columns are selected from a table with a WHERE clause containing one of 10 binary spatial predicate functions filtering on the indexed geometry column in the table and a geometry value that is constant, i.e. able to be fully evaluated before the query is executed. The 10 binary predicates are:

  • ST_Equals
  • ST_Intersects
  • ST_Touches
  • ST_Crosses
  • ST_Within
  • ST_Contains
  • ST_Overlaps
  • ST_Covers
  • ST_CoveredBy
  • ST_ContainsProperly

In the future, the plan is to add other optimizations, such as performing index scans for ST_Distance/ST_DWithin and for speeding up index joins.

How do I use it?

-- Create a table with 10_000_000 random points
CREATE TABLE t1 AS SELECT point::GEOMETRY as geom
FROM st_generatepoints({min_x: 0, min_y: 0, max_x: 10000, max_y: 10000}::BOX_2D, 10_000_000, 1337);

-- Create a index on the table.
CREATE INDEX my_idx ON t1 USING RTREE (geom);

-- Perform a query with a "spatial predicate" on the indexed geometry column
-- Note how the second argument in this case, the ST_MakeEnvelope call is a "constant"
SELECT count(*) FROM t1 WHERE ST_Within(geom, ST_MakeEnvelope(450, 450, 650, 650));
----
3986


-- We can check for ourselves that an RTree index scan is used:
┌───────────────────────────┐
│    UNGROUPED_AGGREGATE    │
│    ────────────────────   │
│        Aggregates:        │
│        count_star()       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ ST_Within(geom, '...')    │ 
│                           │
│       ~2000000 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│     RTREE_INDEX_SCAN      │
│    ────────────────────   │
│   t1 (RTREE INDEX SCAN :  │
│           my_idx)         │
│                           │
│     Projections: geom     │
│                           │
│       ~10000000 Rows      │
└───────────────────────────┘

You can also pass options when creating the RTree index to configure the maximum and minimum capacity for each node within the tree. E.g.

CREATE INDEX my_idx on t1 USING RTREE (geom) WITH (max_node_capacity = 128, min_node_capacity = 64)

The defaults are min_node_capacity = 50, max_node_capacity = 128

Microbenchmarks

On my machine, running the above query as part of a micro-benchmark with and without an index produces the following results:

Without index

name    run     timing
benchmark/rtree_points_noindex.benchmark        1       2.272783
benchmark/rtree_points_noindex.benchmark        2       2.261915
benchmark/rtree_points_noindex.benchmark        3       2.258901
benchmark/rtree_points_noindex.benchmark        4       2.262764
benchmark/rtree_points_noindex.benchmark        5       2.263000

With index

name    run     timing
benchmark/rtree_points_index.benchmark  1       0.050214
benchmark/rtree_points_index.benchmark  2       0.052856
benchmark/rtree_points_index.benchmark  3       0.051340
benchmark/rtree_points_index.benchmark  4       0.052707
benchmark/rtree_points_index.benchmark  5       0.050918

Creating the index itself (after loading the table) takes about 1.2s~.
Creating the index before loading the table takes about 5.3s~.

Running a similar query, but with a 158mb trimmed down parquet file of building footprints extract from overture:

CREATE TABLE t1 as SELECT ST_GeomFromWKB(wkb) as geom, id,
FROM read_parquet('test/data/nyc_taxi/overture_nyc_buildings.parquet');
CREATE INDEX my_idx ON t1 USING RTREE (geom);

run
SELECT count(*) FROM t1 WHERE ST_Within(geom, ST_MakeEnvelope(-74.004936,40.725275,-73.982620,40.745046));

Without index:

benchmark/rtree_noindex.benchmark       1       0.435056
benchmark/rtree_noindex.benchmark       2       0.432082
benchmark/rtree_noindex.benchmark       3       0.434190
benchmark/rtree_noindex.benchmark       4       0.434224
benchmark/rtree_noindex.benchmark       5       0.431633

With index:

benchmark/rtree_index.benchmark 1       0.003337
benchmark/rtree_index.benchmark 2       0.003444
benchmark/rtree_index.benchmark 3       0.003385
benchmark/rtree_index.benchmark 4       0.003798
benchmark/rtree_index.benchmark 5       0.003850

Technical details

Even though the R*-tree is usually regarded as the best-in-class variant, it is also significantly harder to implement properly, particularly when dealing with rebalancing. This implementation is more like a R+Tree in that it only stores row ids in the leaf nodes and rebalances itself during updates and deletes by propagating splits up the tree and re-inserting the entries of under-full nodes from the root naively.

The splitting algorithm however is based on the Corner Based Splitting (CBS)1 algorithm, which from my testing seems to produce very good looking trees considering how fast it is and how easy it is to implement. Resulting nodes look slightly better than the quadratic/linear split methods but maybe slightly worse than R*-splitting.

Bulk loads are performed using the Sort Tile Recursive (STR)2 technique, and all memory is tracked and handled by the buffer manager, and it also uses DuckDB's existing sorting operator as a pre-processing step, so it should be able to handle larger-than memory bulk loads without too much trouble. There's only a single larger buffer that's required to be kept in memory at all times during the bulk-load process which is going to be roughly sqrt(ceil(total_number_of_leaf_entries / node_capacity)) * node_capacity * 40 bytes. However this is also tracked by DuckDB so it will count towards any set memory limit.

Additionally, row-ids are stored in sorted order in the leaf nodes, making updates and deletes much faster.

Future work

  • More rewrite-rules that make use of the index, such as index-joins and filter pushdown for more advanced filters, such as multiple intersection queries and distance filters.

  • Configurable leaf node capacity that can differ from the branch node capacity.

  • Optional Hilbert-sort bulk loading algorithm (if the user supplies a boundary).

Closes #7


This PR also adds ST_AsSVG. And a ST_GenerateRandomPoints table function that only works with rectangle inputs for now. It also adds new overloads for ST_Hilbert.

Footnotes

  1. Sleit, Azzam & Al-Nsour, Esam. (2014). Corner-based splitting: An improved node splitting algorithm for R-tree. Journal of Information Science. 40. 222-236. 10.1177/0165551513516709.

  2. S. T. Leutenegger, M. A. Lopez and J. Edgington, "STR: a simple and efficient algorithm for R-tree packing," Proceedings 13th International Conference on Data Engineering, Birmingham, UK, 1997, pp. 497-506, doi: 10.1109/ICDE.1997.582015.

@Maxxen Maxxen merged commit ceaf512 into duckdb:main Sep 5, 2024
22 checks passed
@aborruso
Copy link
Contributor

aborruso commented Sep 5, 2024

Thank you very much

@raphaellaude
Copy link

This rocks - thank you!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Add Spatial Index support
3 participants