-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add promote_to_multitype keyword in write_dataframe #75
Add promote_to_multitype keyword in write_dataframe #75
Conversation
+ don't cleanup ogr_geometry as it is cleaned up by the forct_to_multitype functions
+ improve test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @theroggy - good to use OGR directly to promote mixed geometries to multi types and get around annoyance of warnings on write to GPKG and exceptions on write to FGB.
I'm not 100% about the default being to promote them for specific drivers, as the user can easily add promote_to_multi
in those cases after hitting an exception. Will have to think about it a bit more.
Co-authored-by: Brendan Ward <[email protected]>
Co-authored-by: Brendan Ward <[email protected]>
in english words aren't stuck together as they are in dutch :-) Co-authored-by: Brendan Ward <[email protected]>
- rename promote_to_multitype to promote_to_multi - remove type annotations - rename suffix to ext - some other small changes
@jorisvandenbossche @martinfleis do you have any opinions on default behavior when outputting mixed geometry plurality (e.g., mix of Point and MultiPoint) to drivers that do not support mixed plurality (e.g., FlatGeobuf)? On the one hand, automatically promoting to multi in this case avoids exceptions on write, but on the other, it modifies output geometries without notification, which concerns me. After looking more closely at what ogr2ogr does (requires command line flag), my instinct is that we should not automatically promote on write, and instead on error (or warning in case of GPKG) the user should use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Added a few inline nitpicks
Yeah, not fully sure myself (and no strong opinion). On the one hand it is somewhat convenient that this would work out of the box, on the other hand it is good to be aware of this as a user (and thus an error with a clear error message to set |
This might be tricky. In the case of errors, we intercept those, so we can probably add this, but the GDAL warning for GPKG not supporting mixed plurality gets written to stdout, which we don't intercept... But maybe we only need to add this for true errors. |
An additional note that FlatGeobuf actually does support mixed geometries, if you set the geometry type to "Unknown". That might actually be the best default? (which relates to my comment above at #75 (comment)). That preserves the data as is, but on the other hand creates potentially less performant / useful files (and without a warning). |
That would be my thinking but as @jorisvandenbossche said, we need a clear error message offering a solution in such a case.
While I am not sure about the consequences of this, it does not feel as a right thing to do. |
My understanding is that in that case you have a "per-feature geometry type", see https://github.com/flatgeobuf/flatgeobuf/blob/7c11b7d3f5bc63c5737258f2b13b2ec0ae9de14c/src/fbs/header.fbs#L70 Given that geopandas also doesn't enforce a single geometry type, it certainly feels as the most "faithfull" way to store mixed geometries in FlatGeobuf (in case you have mixed geometries). |
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
The reasoning of my (mild) preference for the automatic handling is that it is more convenient, and that as far as I know/think the data isn't really changed when converting single types to multi types. I don't know of any difference when working with one or the other. In shapefile the difference isn't even made at the file/layer level. It is also a pity that for geopackage GDAL only gives a warning instead of an error, because that easily leads to "invalid files" being created. In my opionion doing an automatic promotion to multi is less problematic. Putting "unknown" on them is a bit of a pity as well in my opinion, because it makes files a lot less self-describing... and only because of a technicality. In my experience really mixed datasets (eg. combination of lines and polygons) are quite rare. A third option could be to keep an automatic mode by passing in "None", but make promote_to_multi=False the default, so users need to keep making an active choice once... but once they have thought about it they can switch to the automatic handling if they like. Some rather offtopic ramblings:
|
Using I guess I'm struggling a bit with default behavior and fidelity of inputs to outputs under those defaults. Certainly, some drivers mess with the data (e.g., GeoJSONSeq) in ways that mean outputs don't always match inputs to the write operation. But my instinct is that we leave such modifications in the domain of GDAL (outside our control, and hopefully well-documented on their respective driver pages), and where we do have control, we don't otherwise modify the data if we can help it, even if it would be slightly more convenient. It's not terribly hard to add a keyword when you call it if you hit an exception or can even head it off yourself by inspecting your data before writing. "Unknown" does not feel right for mixed plurality but same overall type. I've never been keen on mixed geometry types (mix of points, lines, etc in a single geometry column), and some drivers don't accept this either, so I don't think we should fall back to "Unknown" in these cases to avoid an error. I do wonder - as a separate discussion - if we should support passing in geometry type, so that callers that know they have mixed geometry types AND know they can successfully write that to a given format can pass in "Unknown" in order to be able to faithfully write their dataframe. |
Personally, I think that is exactly what we should do? Most formats support the case of multiple geometry types, and using "Unknown" is what is used to signal to GDAL that there isn't a single geometry type. Small code example: import pyogrio
import geopandas
from shapely.geometry import Point, LineString, Polygon, box
df = geopandas.GeoDataFrame(
{"col": [1, 2, 3]},
geometry=[Point(0, 0), LineString([(0, 0), (1, 1)]), box(0, 0, 1, 1)]
)
>>> pyogrio.write_dataframe(df, "test_mixed.fbg", driver="FlatGeobuf")
...
ERROR 1: ICreateFeature: Mismatched geometry type
---------------------------------------------------------------------------
FeatureError Traceback (most recent call last)
...
FeatureError: Could not add feature to layer at index 1
# modified main branch to allow manually specify the geometry type -> works fine
>>> pyogrio.write_dataframe(df, "test_mixed2.fbg", driver="FlatGeobuf", geometry_type="Unknown") So FlatGeobuf actually supports writing mixed geometry types (signaled with "Unknown"), but fails if you say the file is a certain type but then write different types. This currently happens because of the way we determine an "incorrect" For a format like GeoJSON, which doesn't store geometry type information in the file itself, it doesn't actually matter which |
Now, in my last comment I am maybe mixing the discussions of "mixed but same 'type' geometry types" (single/multi of the same type) vs "really mixed geometry types". But I think the discussion about "should we promote to multi by default or not" is mainly for the "single/multi of same type mixed geometry types"? |
Yes. We should take up proper default handling of truly mixed geometries in a separate issue. |
I had to move/duplicate/rewrite/... some code to be able to decouple both keywords, but I think it does what was agreed on now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing to work on this @theroggy , it is shaping up nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another round of comments!
(".geojson", None, "Point", ["MultiPolygon", "Polygon"], "Unknown"), | ||
(".geojson", True, "Unknown", ["MultiPolygon"], "MultiPolygon"), | ||
(".gpkg", False, "Unknown", ["MultiPolygon", "Polygon"], "Unknown"), | ||
(".gpkg", None, "Unknown", ["MultiPolygon"], "Unknown"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this line result in the same as the one above? (so if you explicitly set the type to Unknown, do not automatic promote_to_multi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so:
- This would break the idea that the layer_geometry_type doesn't change/influence the geometry type of the data
- This would create a kind of odd interference between both parameters while it is already easy to accomplish this behaviour: if you just use promote_to_multi=False, this will evade promotion + will result in layer geometry type "Unknown". Check out the test test_write_dataframe_promote_to_multi where this is demonstrated.
(".geojson", True, "Unknown", ["MultiPolygon"], "MultiPolygon"), | ||
(".gpkg", False, "Unknown", ["MultiPolygon", "Polygon"], "Unknown"), | ||
(".gpkg", None, "Unknown", ["MultiPolygon"], "Unknown"), | ||
(".gpkg", None, "Polygon", ["MultiPolygon"], "Polygon"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for this case, should we infer this as promote_to_multi=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your hard work on this @theroggy and putting up with us as we tried to get our thoughts in order about how best to proceed.
I think the only thing outstanding is the comment from @jorisvandenbossche about whether or not GPKG get autopromoted by default (if I interpreted it correctly). I don't have a strong opinion about it, but the idea of producing valid GPKG by default and not having the warning appeals to me (current implementation).
To be precise (assuming it is about #75 (comment)), it's not about the completely default behaviour, but about the interplay between More specifically, consider the case of DataFrame with a mixture of polygons and multipolygons (like the naturalearth countries) and doing There are two options for how to infer the default "promote" behaviour:
The current test ensures the second behaviour, and I was wondering if we wouldn't rather want the first (so using the information of the geometry type the user specified). I think the general question is: for inferring the default
I can also live with both options, but to be clear: both options would produce a valid GPKG and not give a warning. A GPKG with "Unknown" ("Geometry") set as geometry is valid and can contain both Polygons/MultiPolygons (so there is no strict need to promote to only MultiPolygons). On the one hand, I think the option 1 that I suggested gives the more expected behaviour (if the user sets the geometry type explicitly to "Unknown", I think they typically don't want to promote to multi). On the other hand, the current option 2 keeps the logic of both keywords a bit more separated (promote_to_multi already influences geometry_type, but not yet the other way around), which will be a bit simpler. |
Thanks for clarifying! I understand better now the tradeoffs you were getting at. I like option 2: no interaction between the two parameters (for now), and If we get feedback that this is causing issues for users, it seems like we can revisit the interaction of these parameters in a future PR. |
Co-authored-by: Brendan Ward <[email protected]>
To be clear, as I mentioned in the comment above (#75 (comment)), it's not needed to specify both, just specifying
This is the test that tests and demonstrates this behaviour: the combination of
|
I'm going to go ahead and merge so that we can start getting closer to 0.4 release (want to create an alpha release soon, to test our wheels), assuming consent from @jorisvandenbossche , but if need be we can revisit the interaction between |
Thanks again @theroggy for all your work on this PR! 🎉 |
This might have already been covered above, but just discovered in further testing that GDAL seems to ignore "Unkown" when set by default when writing |
Yes, it is also covered in a test. |
Closes #56
I also gave a shot at an automatic mode: if None is passed (the default):
-> promote_to_multitype=True
I also made some changes to the write_dataframe test so it also tests the automatic detection of the driver + all supported file types
PS: in an ideal world, I think ?
sort_geodataframe
function would be available as a parameter inassert_geodataframe_is_equal
to_multipolygon
function would be available in GeoPandas (or rather a version that supports all geometry types)PPS: when testing on 550.000 polygons, promote_to_multitype=True doesn't give any performance hit when writing a GeoDataFrame