Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterize types with CRS? #441

Open
paleolimbot opened this issue Oct 27, 2024 · 3 comments
Open

Parameterize types with CRS? #441

paleolimbot opened this issue Oct 27, 2024 · 3 comments

Comments

@paleolimbot
Copy link

Apologies if I've already opened this! I have a vague recollection of discussing this at some point and having the answer be that it wasn't possible to register a parameterized type from an extension (at least then, which was probably quite a while ago). I'm having a go at porting s2geography as a DuckDB extension and ran across:

https://github.com/duckdb/duckdb/blob/e791508e9bc2eb84bc87eb794074f4893093b743/test/extension/loadable_extension_demo.cpp#L255-L261

...which seems to be a parameterized "bounded" type registered from an extension. So maybe it's now possible?

(Also loads of precedent for item-level CRS specification with PostGIS and alikes, but also quite a bit for type-level CRS with GeoParquet/GeoPandas/GeoArrow/R+sf!)

@Maxxen
Copy link
Member

Maxxen commented Oct 27, 2024

Great find haha! Yes and no.. I added basic support for parameterizing types in order to support specifying the CRS for the GEOMETRY in the spatial extension, but im not sure the current implementation is ready to be used for that just yet. The casting semantics between parameterized/non-parameterized types is hard-coded and its not very elegant to propagate type mods across function calls (you have to implement a custom bind hook and inspect the argument types and set the output type manually). Additionally you can only pass numbers or strings (not identifiers) as type mods, their position matters (e.g. A(B,C) is not equal to A(C,B)) and they cannot be "named".

So feel free to give it a shot, but just know this is very new code and it's not really used anywhere else yet and we may make more changes to it in the future.

As a side note, why parameterized the GEOGRAPHY type? I thought s2 only handled pure sphere calculations... I guess for the metadata?

Another side note, we'll probably adda a GEOGRAPHY type to spatial at some point in the future as well, would be cool if the extensions could interoperate nicely!

@Maxxen
Copy link
Member

Maxxen commented Oct 27, 2024

Now, I think the bigger problem with tagging the type with the CRS is, what do you actually use as the modifier? In PostGIS there is a SPATIAL_REF_SYS table that maps integer SRID's to the supported projections. It just so happens that e.g. 4326 maps to EPSG:4326, but that's not necessarily guaranteed. DuckDB spatial embeds the default proj database and will thus recognize a lot of different identifiers for many of the common projections but it doesn't have a SPATIAL_REF_SYS table where it can map SRIDs to.

And honestly, Im not sure it makes a lot of sense to ship a SPATIAL_REF_SYS table for spatial either given how people usually use duckdb for a lot of quick ad-hoc analysis spanning multiple databases and external formats, compared to PostGIS where id expect most workloads to run within a single Postgres instance. Spatialite (for sqlite) has a InitSpatialMetaData() pragma that will populate the currently attached database with default SRID mappings, but that seems like pretty bad UX to me.

  • It's another "required" command you need to run before you can get started using spatial, and you'd pretty much always need to run it immediately when running spatial in in-memory mode.
  • I don't think users like that their databases gets polluted with system tables that they're not supposed to mess with
  • A lot of duckdb users do a lot of manual partitioning of their datasets into multiple duckdb databases both for structural and performance reasons, they'd have to instantiate the spatial ref sys table in each database file even though it might not differ at all.
  • How do we know what SPATIAL_REF_SYS table to use if there are multiple duckdb databases attached?
  • How do we know what SPATIAL_REF_SYS table to use in queries that select from multiple attached duckdb databases?
  • How do we handle e.g. conflicting user defined SRID definitions across multiple duckdb databases?

There's also some technical limitations, DuckDB can't really query a table to do a SRID lookup from inside a function call effectively so the actual backing storage would have to be something else... but we still want users to be able to insert their own CRS definitions so it can't just be a VIEW either.

I suppose it would be possible to tag the GEOMETRY type not with an integer SRID but instead a more elaborate CRS definition inline... but I don't think that's great UX either if users have to provide e.g. a whole PROJJSON string in the DDL, like
CREATE TABLE t1 (g GEOMETRY("{"$schema": "https://proj.org/schemas/v0.4/projjson.schema.json", "type": "GeographicCRS", ...)) would be pretty ridiculous.

I'd be happy to get your thoughts on this!

@paleolimbot
Copy link
Author

I added basic support for parameterizing types in order to support specifying the CRS for the GEOMETRY in the spatial extension, but im not sure the current implementation is ready to be used for that just yet.

Cool! No pressure to do this if it's not ready yet, although I do hope you go for the "CRS is a property of the type" rather than the "item level SRID" model. This is not a critique in any way because I know this is an infinite amount of work and that you know this already, but silently dropping the CRS everywhere is a common source of silently erroneous answers (and causes general distrust from the hardcore spatial folks).

As a side note, why parameterized the GEOGRAPHY type?

The engine I'm using (s2geometry) doesn't care, but others (geographiclib) do because they use the elipsoid for distance calculation and (sometimes) edge intersection. There's also more than one lon/lat (e.g., NAD83/NAD27/WGS84), which applies equally to both. I'll probably punt this problem for later but I thought maybe eating that complexity early would prevent a massive refactor. It sounds like I might need a refactor either way until this gets sorted from DuckDB's end so I'll probably punt 🙂.

Now, I think the bigger problem with tagging the type with the CRS is, what do you actually use as the modifier?

A string! People get into some pretty heavy flame wars about exactly what to use here but a string is what PROJ needs to construct a CRS and is what your (external) data sources will provide. PROJJSON is nice if you (or another extension) want(s) to circumvent PROJ to accelerate some common transformations but really all you need is something that works from GDAL/PROJ's "make me a CRS from a string".

In PostGIS there is a SPATIAL_REF_SYS table that maps integer SRID's to the supported projections.

I agree you don't want this here...this is one of the best arguments against item level CRS-as-srid for me: you can't very well store a full CRS definition at the element level, and if you don't do that you need a place to look them up. The only reason to do this is if you absolutely cannot parameterize your type (e.g., PostgreSQL) or if you need to integrate tightly with PostGIS as a source.

CREATE TABLE t1 (g GEOMETRY("{"$schema": "https://proj.org/schemas/v0.4/projjson.schema.json", "type": "GeographicCRS", ...)) would be pretty ridiculous.

Typing that is obviously garbage but in terms of ensuring people reading from your table aren't going to get silently incorrect answers, you really do want a full definition of some kind there. Not sure if function calls are allowed there but CREATE TABLE t1 (g GEOMETRY(CRS_PROJJSON("OGC:CRS84")), ...).

Another side note, we'll probably adda a GEOGRAPHY type to spatial at some point in the future as well, would be cool if the extensions could interoperate nicely!

I'm not sure if my thing will amount to any more than a fun project to learn about DuckDB, but it's worth considering where this extension ends and another begins. There is more than one instance in the spatial universe of everything getting lumped into one big kitchen sink of spatial and I hope that can be avoided here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants