Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ship function documentation, improve remote st_read calls #291

Merged
merged 9 commits into from
Mar 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
350 changes: 15 additions & 335 deletions README.md

Large diffs are not rendered by default.

1,901 changes: 0 additions & 1,901 deletions docs/docs.md

This file was deleted.

172 changes: 172 additions & 0 deletions docs/example.md

Large diffs are not rendered by default.

2,225 changes: 2,225 additions & 0 deletions docs/functions.md

Large diffs are not rendered by default.

51 changes: 28 additions & 23 deletions docs/src/functions/table/st_drivers.md → docs/internals.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,26 @@
---
{
"type": "table_function",
"title": "ST_Drivers",
"id": "st_drivers",
"signatures": [
{
"parameters": []
}
],
"summary": "Returns the list of supported GDAL drivers",
"tags": []
}
---
# Spatial Internals

### Description
## Multi-tiered Geometry Type System
This extension implements 5 different geometry types. Like almost all geospatial databases we include a `GEOMETRY` type that (at least strives) to follow the Simple Features geometry model. This includes support for the standard subtypes, such as `POINT`, `LINESTRING`, `POLYGON`, `MULTIPOINT`, `MULTILINESTRING`, `MULTIPOLYGON`, `GEOMETRYCOLLECTION` that we all know and love, internally represented in a row-wise fashion on top of DuckDB `BLOB`s. The internal binary format is very similar to the one used by PostGIS - basically `double` aligned WKB, and we may eventually look into enforcing the format to be properly compatible with PostGIS (which may be useful for the PostGIS scanner extension). Most functions that are implemented for this type uses the [GEOS library](https://github.com/libgeos/geos), which is a battle-tested C++ port of the famous `JTS` library, to perform the actual operations on the geometries.

Returns the list of supported GDAL drivers and file formats for [ST_Read](##st_read).
While having a flexible and dynamic `GEOMETRY` type is great to have, it is comparatively rare to work with columns containing mixed-geometries after the initial import and cleanup step. In fact, in most OLAP use cases you will probably only have a single geometry type in a table, and in those cases you're paying the performance cost to de/serialize and branch on the internal geometry format unneccessarily, i.e. you're paying for flexibility you're not using. For those cases we implement a set of non-standard DuckDB "native" geometry types, `POINT_2D`, `LINESTRING_2D`, `POLYGON_2D`, and `BOX_2D`. These types are built on DuckDBs `STRUCT` and `LIST` types, and are stored in a columnar fashion with the coordinate dimensions stored in separate "vectors". This makes it possible to leverage DuckDB's per-column statistics, compress much more efficiently and perform spatial operations on these geometries without having to de/serialize them first. Storing the coordinate dimensions into separate vectors also allows casting and converting between geometries with multiple different dimensions basically for free. And if you truly need to mix a couple of different geometry types, you can always use a DuckDB [UNION type](https://duckdb.org/docs/sql/data_types/union).

Note that far from all of these drivers have been tested properly, and some may require additional options to be passed to work as expected. If you run into any issues please first consult the [consult the GDAL docs](https://gdal.org/drivers/vector/index.html).
For now only a small amount of spatial functions are overloaded for these native types, but since they can be implicitly cast to `GEOMETRY` you can always use any of the functions that are implemented for `GEOMETRY` on them as well in the meantime while we work on adding more (although with a de/serialization penalty).

### Examples
This extension also includes a `WKB_BLOB` type as an alias for `BLOB` that is used to indicate that the blob contains valid WKB encoded geometry.

```sql
SELECT * FROM ST_Drivers();
```
## Per-thread Arena Allocation for Geometry Objects
When materializing the `GEOMETRY` type objects from the internal binary format we use per-thread arena allocation backed by DuckDB's buffer manager to amortize the contention and performance cost of performing lots of small heap allocations and frees, which allows us to utilizes DuckDB's multi-threaded vectorized out-of-core execution fully. While most spatial functions are implemented by wrapping `GEOS`, which requires an extra copy/allocation step anyway, the plan is to incrementally implementat our own versions of the simpler functions that can operate directly on our own `GEOMETRY` representation in order to greatly accelerate geospatial processing.

## Embedded PROJ Database
[PROJ](https://proj.org/#) is a generic coordinate transformation library that transforms geospatial coordinates from one projected coordinate reference system (CRS) to another. This extension experiments with including an embedded version of the PROJ database inside the extension binary itself so that you don't have to worry about installing the PROJ library separately. This also opens up the possibility to use this functionality in WASM.

## Embedded GDAL based Input/Output Functions
[GDAL](https://github.com/OSGeo/gdal) is a translator library for raster and vector geospatial data formats. This extension includes and exposes a subset of the GDAL vector drivers through the `ST_Read` and `COPY ... TO ... WITH (FORMAT GDAL)` table and copy functions respectively to read and write geometry data from and to a variety of file formats as if they were DuckDB tables. We currently support the over 50 GDAL formats - check for yourself by running
<details>
<summary>
SELECT * FROM st_drivers();
</summary>

| short_name | long_name | can_create | can_copy | can_open | help_url |
|----------------|------------------------------------------------------|------------|----------|----------|----------------------------------------------------|
Expand Down Expand Up @@ -75,8 +72,16 @@ SELECT * FROM ST_Drivers();
| MVT | Mapbox Vector Tiles | true | false | true | https://gdal.org/drivers/vector/mvt.html |
| NGW | NextGIS Web | true | true | true | https://gdal.org/drivers/vector/ngw.html |
| MapML | MapML | true | false | true | https://gdal.org/drivers/vector/mapml.html |
| PMTiles | ProtoMap Tiles | true | false | true | https://gdal.org/drivers/vector/pmtiles.html |
| JSONFG | OGC Features and Geometries JSON | true | false | true | https://gdal.org/drivers/vector/jsonfg.html |
| TIGER | U.S. Census TIGER/Line | false | false | true | https://gdal.org/drivers/vector/tiger.html |
| AVCBin | Arc/Info Binary Coverage | false | false | true | https://gdal.org/drivers/vector/avcbin.html |
| AVCE00 | Arc/Info E00 (ASCII) Coverage | false | false | true | https://gdal.org/drivers/vector/avce00.html |

</details>

Note that far from all of these formats have been tested properly, if you run into any issues please first [consult the GDAL docs](https://gdal.org/drivers/vector/index.html), or open an issue here on GitHub.


`ST_Read` also supports limited support for predicate pushdown and spatial filtering (if the underlying GDAL driver supports it), but column pruning (projection pushdown) while technically feasible is not yet implemented.
`ST_Read` also allows using GDAL's virtual filesystem abstractions to read data from remote sources such as S3, or from compressed archives such as zip files.

**Note**: This functionality does not make full use of parallelism due to GDAL not being thread-safe, so you should expect this to be slower than using e.g. the DuckDB Parquet extension to read the same GeoParquet or DuckDBs native csv reader to read csv files. Once we implement support for reading more vector formats natively through this extension (e.g. GeoJSON, GeoBuf, ShapeFile) we will probably split this entire GDAL part into a separate extension.
80 changes: 0 additions & 80 deletions docs/src/README.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/src/aggregate_functions.json

This file was deleted.

31 changes: 0 additions & 31 deletions docs/src/functions/aggregate/st_envelope_agg.md

This file was deleted.

31 changes: 0 additions & 31 deletions docs/src/functions/aggregate/st_intersection_agg.md

This file was deleted.

31 changes: 0 additions & 31 deletions docs/src/functions/aggregate/st_union_agg.md

This file was deleted.

73 changes: 0 additions & 73 deletions docs/src/functions/scalar/st_area.md

This file was deleted.

44 changes: 0 additions & 44 deletions docs/src/functions/scalar/st_area_spheroid.md

This file was deleted.

Loading
Loading