Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add promote_to_multitype keyword in write_dataframe #75

Merged

Conversation

theroggy
Copy link
Member

@theroggy theroggy commented Apr 19, 2022

Closes #56

I also gave a shot at an automatic mode: if None is passed (the default):

  • if the geodataframe is mixed multi+single for the same primitive type + the output format doesn't support mixed types
    -> promote_to_multitype=True
  • otherwise it is False

I also made some changes to the write_dataframe test so it also tests the automatic detection of the driver + all supported file types

PS: in an ideal world, I think ?

  • the sort_geodataframe function would be available as a parameter in assert_geodataframe_is_equal
  • the to_multipolygon function would be available in GeoPandas (or rather a version that supports all geometry types)

PPS: when testing on 550.000 polygons, promote_to_multitype=True doesn't give any performance hit when writing a GeoDataFrame

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @theroggy - good to use OGR directly to promote mixed geometries to multi types and get around annoyance of warnings on write to GPKG and exceptions on write to FGB.

I'm not 100% about the default being to promote them for specific drivers, as the user can easily add promote_to_multi in those cases after hitting an exception. Will have to think about it a bit more.

pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/raw.py Outdated Show resolved Hide resolved
pyogrio/raw.py Outdated Show resolved Hide resolved
CHANGES.md Outdated Show resolved Hide resolved
pyogrio/_geometry.pxd Outdated Show resolved Hide resolved
pyogrio/_geometry.pyx Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
theroggy and others added 4 commits April 20, 2022 00:58
Co-authored-by: Brendan Ward <[email protected]>
Co-authored-by: Brendan Ward <[email protected]>
in english words aren't stuck together as they are in dutch :-)

Co-authored-by: Brendan Ward <[email protected]>
- rename promote_to_multitype to promote_to_multi
- remove type annotations
- rename suffix to ext
- some other small changes
@brendan-ward
Copy link
Member

@jorisvandenbossche @martinfleis do you have any opinions on default behavior when outputting mixed geometry plurality (e.g., mix of Point and MultiPoint) to drivers that do not support mixed plurality (e.g., FlatGeobuf)?

On the one hand, automatically promoting to multi in this case avoids exceptions on write, but on the other, it modifies output geometries without notification, which concerns me.

After looking more closely at what ogr2ogr does (requires command line flag), my instinct is that we should not automatically promote on write, and instead on error (or warning in case of GPKG) the user should use promote_to_multi parameter to get around it.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Added a few inline nitpicks

pyogrio/_geometry.pyx Outdated Show resolved Hide resolved
pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

do you have any opinions on default behavior when outputting mixed geometry plurality (e.g., mix of Point and MultiPoint) to drivers that do not support mixed plurality (e.g., FlatGeobuf)?

Yeah, not fully sure myself (and no strong opinion). On the one hand it is somewhat convenient that this would work out of the box, on the other hand it is good to be aware of this as a user (and thus an error with a clear error message to set promote_to_multi=True to solve it, would ensure users are aware).

@brendan-ward
Copy link
Member

an error with a clear error message

This might be tricky. In the case of errors, we intercept those, so we can probably add this, but the GDAL warning for GPKG not supporting mixed plurality gets written to stdout, which we don't intercept... But maybe we only need to add this for true errors.

@jorisvandenbossche
Copy link
Member

An additional note that FlatGeobuf actually does support mixed geometries, if you set the geometry type to "Unknown". That might actually be the best default? (which relates to my comment above at #75 (comment)). That preserves the data as is, but on the other hand creates potentially less performant / useful files (and without a warning).

@martinfleis
Copy link
Member

we should not automatically promote on write, and instead on error (or warning in case of GPKG) the user should use promote_to_multi parameter to get around it.

That would be my thinking but as @jorisvandenbossche said, we need a clear error message offering a solution in such a case.

An additional note that FlatGeobuf actually does support mixed geometries, if you set the geometry type to "Unknown". That might actually be the best default?

While I am not sure about the consequences of this, it does not feel as a right thing to do.

@jorisvandenbossche
Copy link
Member

An additional note that FlatGeobuf actually does support mixed geometries, if you set the geometry type to "Unknown". That might actually be the best default?

While I am not sure about the consequences of this, it does not feel as a right thing to do.

My understanding is that in that case you have a "per-feature geometry type", see https://github.com/flatgeobuf/flatgeobuf/blob/7c11b7d3f5bc63c5737258f2b13b2ec0ae9de14c/src/fbs/header.fbs#L70
It seems that FlatGeobuf supports both a single geometry type for the whole dataset, as mixed geometry types by keeping track of the type per feature/geometry (https://github.com/flatgeobuf/flatgeobuf/blob/7c11b7d3f5bc63c5737258f2b13b2ec0ae9de14c/src/fbs/feature.fbs#L12)

Given that geopandas also doesn't enforce a single geometry type, it certainly feels as the most "faithfull" way to store mixed geometries in FlatGeobuf (in case you have mixed geometries).
(but no idea about the consequence on the resulting file, e.g. efficiency to read that. Although given that the format is row-oriented anyway, I might imagine that doesn't matter much)

theroggy and others added 3 commits April 21, 2022 13:11
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
@theroggy
Copy link
Member Author

theroggy commented Apr 21, 2022

The reasoning of my (mild) preference for the automatic handling is that it is more convenient, and that as far as I know/think the data isn't really changed when converting single types to multi types. I don't know of any difference when working with one or the other. In shapefile the difference isn't even made at the file/layer level.
It would even be possible to losslessly convert them back to single when reading a file if that would have any advantage/use.

It is also a pity that for geopackage GDAL only gives a warning instead of an error, because that easily leads to "invalid files" being created. In my opionion doing an automatic promotion to multi is less problematic.

Putting "unknown" on them is a bit of a pity as well in my opinion, because it makes files a lot less self-describing... and only because of a technicality. In my experience really mixed datasets (eg. combination of lines and polygons) are quite rare.

A third option could be to keep an automatic mode by passing in "None", but make promote_to_multi=False the default, so users need to keep making an active choice once... but once they have thought about it they can switch to the automatic handling if they like.

Some rather offtopic ramblings:

  • It's a pity that this issue was created in the new file formats (gpkg and flatgeobuf). Shapefile has a huge amount of problems, but treats this in a good way in my opinion... so I don't understand why this issue was "created".
  • Depending on the format some conversions have actually a lot more impact because they are "lossy". In Oracle Spatial, Json, WKT,... points are stored as numbers with a fixed number of decimals, while in shapefile, GPKG, WKB,... they are stored as doubles. These conversions are done without warning even though they will even lead to geometries becoming invalid due to self intersections appearing,... so they have a lot more impact.

@brendan-ward
Copy link
Member

Using promote_to_multi=True|False|None as a tri-state where None means that default behaviour is invoked feels weird to me, yet converting this to a tri-state string promote_to_multi="yes"|"no"|"auto" feels a bit verbose.

I guess I'm struggling a bit with default behavior and fidelity of inputs to outputs under those defaults. Certainly, some drivers mess with the data (e.g., GeoJSONSeq) in ways that mean outputs don't always match inputs to the write operation. But my instinct is that we leave such modifications in the domain of GDAL (outside our control, and hopefully well-documented on their respective driver pages), and where we do have control, we don't otherwise modify the data if we can help it, even if it would be slightly more convenient. It's not terribly hard to add a keyword when you call it if you hit an exception or can even head it off yourself by inspecting your data before writing.

"Unknown" does not feel right for mixed plurality but same overall type. I've never been keen on mixed geometry types (mix of points, lines, etc in a single geometry column), and some drivers don't accept this either, so I don't think we should fall back to "Unknown" in these cases to avoid an error. I do wonder - as a separate discussion - if we should support passing in geometry type, so that callers that know they have mixed geometry types AND know they can successfully write that to a given format can pass in "Unknown" in order to be able to faithfully write their dataframe.

@jorisvandenbossche
Copy link
Member

"Unknown" does not feel right for mixed plurality but same overall type. I've never been keen on mixed geometry types (mix of points, lines, etc in a single geometry column), and some drivers don't accept this either, so I don't think we should fall back to "Unknown" in these cases to avoid an error.

Personally, I think that is exactly what we should do? Most formats support the case of multiple geometry types, and using "Unknown" is what is used to signal to GDAL that there isn't a single geometry type.
I won't say that I recommend to use a mixture of geometry types :), but if you have such data, why would we prohibit writing those?

Small code example:

import pyogrio
import geopandas
from shapely.geometry import Point, LineString, Polygon, box

df = geopandas.GeoDataFrame(
    {"col": [1, 2, 3]},
    geometry=[Point(0, 0), LineString([(0, 0), (1, 1)]), box(0, 0, 1, 1)]
)

>>> pyogrio.write_dataframe(df, "test_mixed.fbg", driver="FlatGeobuf")
...
ERROR 1: ICreateFeature: Mismatched geometry type
---------------------------------------------------------------------------
FeatureError                              Traceback (most recent call last)
...
FeatureError: Could not add feature to layer at index 1

# modified main branch to allow manually specify the geometry type -> works fine
>>> pyogrio.write_dataframe(df, "test_mixed2.fbg", driver="FlatGeobuf", geometry_type="Unknown")

So FlatGeobuf actually supports writing mixed geometry types (signaled with "Unknown"), but fails if you say the file is a certain type but then write different types. This currently happens because of the way we determine an "incorrect" geometry_type, and pass that through.
The same holds true for GeoPackage (it only warns instead of errorring). GeoPackage calls this the "Geometry" geometry type (for which GDAL uses "Unknown"), see http://www.geopackage.org/spec121/#r32 (so it is explicitly supported to write multiple geometry types into a single column this way).

For a format like GeoJSON, which doesn't store geometry type information in the file itself, it doesn't actually matter which geometry_type is passed to GDAL.

@jorisvandenbossche
Copy link
Member

Now, in my last comment I am maybe mixing the discussions of "mixed but same 'type' geometry types" (single/multi of the same type) vs "really mixed geometry types".
My comment above is arguing that at least for the "really mixed geometry types", I think we should just make this work (as expected for me), if the format supports it (Shapefile is one example that does not support it, but eg GPKG, GeoJSON, FBG all support mixed geometries).

But I think the discussion about "should we promote to multi by default or not" is mainly for the "single/multi of same type mixed geometry types"?

@brendan-ward
Copy link
Member

the discussion about "should we promote to multi by default or not" is mainly for the "single/multi of same type mixed geometry types"

Yes. We should take up proper default handling of truly mixed geometries in a separate issue.

@theroggy theroggy marked this pull request as ready for review May 15, 2022 23:25
@theroggy
Copy link
Member Author

I had to move/duplicate/rewrite/... some code to be able to decouple both keywords, but I think it does what was agreed on now...

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to work on this @theroggy , it is shaping up nicely.

pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of comments!

pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/geopandas.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
(".geojson", None, "Point", ["MultiPolygon", "Polygon"], "Unknown"),
(".geojson", True, "Unknown", ["MultiPolygon"], "MultiPolygon"),
(".gpkg", False, "Unknown", ["MultiPolygon", "Polygon"], "Unknown"),
(".gpkg", None, "Unknown", ["MultiPolygon"], "Unknown"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line result in the same as the one above? (so if you explicitly set the type to Unknown, do not automatic promote_to_multi)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so:

  • This would break the idea that the layer_geometry_type doesn't change/influence the geometry type of the data
  • This would create a kind of odd interference between both parameters while it is already easy to accomplish this behaviour: if you just use promote_to_multi=False, this will evade promotion + will result in layer geometry type "Unknown". Check out the test test_write_dataframe_promote_to_multi where this is demonstrated.

pyogrio/tests/test_geopandas_io.py Show resolved Hide resolved
pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
(".geojson", True, "Unknown", ["MultiPolygon"], "MultiPolygon"),
(".gpkg", False, "Unknown", ["MultiPolygon", "Polygon"], "Unknown"),
(".gpkg", None, "Unknown", ["MultiPolygon"], "Unknown"),
(".gpkg", None, "Polygon", ["MultiPolygon"], "Polygon"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for this case, should we infer this as promote_to_multi=False?

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your hard work on this @theroggy and putting up with us as we tried to get our thoughts in order about how best to proceed.

I think the only thing outstanding is the comment from @jorisvandenbossche about whether or not GPKG get autopromoted by default (if I interpreted it correctly). I don't have a strong opinion about it, but the idea of producing valid GPKG by default and not having the warning appeals to me (current implementation).

pyogrio/tests/test_geopandas_io.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 20, 2022

I think the only thing outstanding is the comment from @jorisvandenbossche about whether or not GPKG get autopromoted by default (if I interpreted it correctly).

To be precise (assuming it is about #75 (comment)), it's not about the completely default behaviour, but about the interplay between layer_geometry_type and promote_to_multi if one of both is specified.

More specifically, consider the case of DataFrame with a mixture of polygons and multipolygons (like the naturalearth countries) and doing pyogrio.write_dataframe(df, "data.gpkg", layer_geometry_type="Unknown") (and promote_to_multi not being specified, so using the default None).

There are two options for how to infer the default "promote" behaviour:

  • Since the geometry type is set to "Unknown" by the user (i.e. not a strict Multi type), it is not needed to infer promote_to_multi=True and do promotion
  • We infer the default promote_to_multi solely on the input geometry types (being Polygon and MultiPolygon in this case), and thus for GPKG default to True, regardless of the explicit layer geometry type specified by the user.

The current test ensures the second behaviour, and I was wondering if we wouldn't rather want the first (so using the information of the geometry type the user specified).
Now, to be clear, also with the second behaviour, the user can get the first behaviour by doing pyogrio.write_dataframe(..., layer_geometry_type="Unknown", promote_to_multi=False), so by specifying promote_to_multi specifically as well.

I think the general question is: for inferring the default promote_to_multi, do we solely look at the input dataframe's geometry types, or do we also take the user-specified layer_geometry_type into account.

I don't have a strong opinion about it, but the idea of producing valid GPKG by default and not having the warning appeals to me (current implementation).

I can also live with both options, but to be clear: both options would produce a valid GPKG and not give a warning. A GPKG with "Unknown" ("Geometry") set as geometry is valid and can contain both Polygons/MultiPolygons (so there is no strict need to promote to only MultiPolygons).

On the one hand, I think the option 1 that I suggested gives the more expected behaviour (if the user sets the geometry type explicitly to "Unknown", I think they typically don't want to promote to multi). On the other hand, the current option 2 keeps the logic of both keywords a bit more separated (promote_to_multi already influences geometry_type, but not yet the other way around), which will be a bit simpler.

@brendan-ward
Copy link
Member

Thanks for clarifying! I understand better now the tradeoffs you were getting at.

I like option 2: no interaction between the two parameters (for now), and promote_to_multi has higher precedence even if default. Setting layer_geometry_type seems a little bit more of a power user control, so it seems reasonable that the user would need to provide both in order to disable auto-promotion for specific drivers.

If we get feedback that this is causing issues for users, it seems like we can revisit the interaction of these parameters in a future PR.

@theroggy
Copy link
Member Author

theroggy commented May 21, 2022

Thanks for clarifying! I understand better now the tradeoffs you were getting at.

I like option 2: no interaction between the two parameters (for now), and promote_to_multi has higher precedence even if default. Setting layer_geometry_type seems a little bit more of a power user control, so it seems reasonable that the user would need to provide both in order to disable auto-promotion for specific drivers.
If we get feedback that this is causing issues for users, it seems like we can revisit the interaction of these parameters in a future PR.

To be clear, as I mentioned in the comment above (#75 (comment)), it's not needed to specify both, just specifying promote_to_multi=False will:

  1. disable promotion, obviously
  2. because promotion is disabled, the data in the example will keep containing multiple geometry types (Polygon and MultiPolygon), so the default resulting layer geometry type will be "Unknown".

This is the test that tests and demonstrates this behaviour: the combination of promote_to_multi=False and layer_geometry_type not being specified (= None) already results in expected_layer_geometry_type being "Unknown":

@pytest.mark.parametrize(
    "ext, promote_to_multi, expected_geometry_types, expected_layer_geometry_type",
    [   
        ...
        (".fgb", False, ["MultiPolygon", "Polygon"], "Unknown"),
        ...
    ],
)
def test_write_dataframe_promote_to_multi(
    tmp_path,
    naturalearth_lowres,
    ext,
    promote_to_multi,
    expected_geometry_types,
    expected_layer_geometry_type,
):
    input_gdf = read_dataframe(naturalearth_lowres)

    output_path = tmp_path / f"test_promote{ext}"
    write_dataframe(
        input_gdf,
        output_path,
        promote_to_multi=promote_to_multi,
    )

    assert output_path.exists()
    output_gdf = read_dataframe(output_path)
    geometry_types = sorted(output_gdf.geometry.type.unique())
    assert geometry_types == expected_geometry_types
    assert read_info(output_path)["geometry_type"] == expected_layer_geometry_type

@brendan-ward
Copy link
Member

I'm going to go ahead and merge so that we can start getting closer to 0.4 release (want to create an alpha release soon, to test our wheels), assuming consent from @jorisvandenbossche , but if need be we can revisit the interaction between layer_geometry_type and promote_to_multi before we cut our final 0.4 release.

@brendan-ward brendan-ward merged commit 5fd6b51 into geopandas:main May 26, 2022
@brendan-ward
Copy link
Member

Thanks again @theroggy for all your work on this PR! 🎉

@brendan-ward
Copy link
Member

This might have already been covered above, but just discovered in further testing that GDAL seems to ignore "Unkown" when set by default when writing naturalearth_lowres to shapefile, as well as when it is intentionally set via layer_geometry_type. read_info always returns the type as Polygon. This isn't a problem, just something we may want to note where GDAL ignores "Unknown" for some drivers. (truly mixed types not allowed for shapefile)

@theroggy
Copy link
Member Author

This might have already been covered above, but just discovered in further testing that GDAL seems to ignore "Unkown" when set by default when writing naturalearth_lowres to shapefile, as well as when it is intentionally set via layer_geometry_type. read_info always returns the type as Polygon. This isn't a problem, just something we may want to note where GDAL ignores "Unknown" for some drivers. (truly mixed types not allowed for shapefile)

Yes, it is also covered in a test.

@theroggy theroggy deleted the to_multitype-option-in-write_dataframe branch May 26, 2022 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants