Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geojson: demarcate collections #35

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

michaelkirk
Copy link
Member

@michaelkirk michaelkirk commented May 13, 2022

I'm not sure if this is controversial - and it's definitely a bit nuanced. Let me know what you think!

The goal:
Allow WKT to emit a single top level item - multiple top level WKT items like POINT(1 1), POINT(2 2) are not parseable by e.g. postgis or georust/wkt (we removed support in georust/wkt#72)

And this change also allows for lossless roundtripping of geojson->fgb->geojson, see https://github.com/flatgeobuf/flatgeobuf/compare/master...michaelkirk:mkirk/rust-lossless-geojson-collections?expand=1

Ok(())
}

/// Process top-level GeoJSON items (geometry only)
fn process_geojson_geom<P: GeomProcessor>(gj: &GeoGeoJson, processor: &mut P) -> Result<()> {
match *gj {
GeoGeoJson::FeatureCollection(ref collection) => {
processor.geometrycollection_begin(collection.features.len(), 0)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right to me. A GeoJSON feature collection can contain any geometry type, not only GeometryCollections.

@@ -372,7 +365,7 @@ mod test {
let mut wkt_data: Vec<u8> = Vec::new();
assert!(read_geojson_fc(geojson.as_bytes(), &mut WktWriter::new(&mut wkt_data)).is_ok());
let wkt = std::str::from_utf8(&wkt_data).unwrap();
assert_eq!(wkt, "MULTIPOLYGON(((173.020375 -40.919052,173.247234 -41.331999,173.958405 -40.926701,174.247587 -41.349155,174.248517 -41.770008,173.876447 -42.233184,173.22274 -42.970038,172.711246 -43.372288,173.080113 -43.853344,172.308584 -43.865694,171.452925 -44.242519,171.185138 -44.897104,170.616697 -45.908929,169.831422 -46.355775,169.332331 -46.641235,168.411354 -46.619945,167.763745 -46.290197,166.676886 -46.219917,166.509144 -45.852705,167.046424 -45.110941,168.303763 -44.123973,168.949409 -43.935819,169.667815 -43.555326,170.52492 -43.031688,171.12509 -42.512754,171.569714 -41.767424,171.948709 -41.514417,172.097227 -40.956104,172.79858 -40.493962,173.020375 -40.919052)),((174.612009 -36.156397,175.336616 -37.209098,175.357596 -36.526194,175.808887 -36.798942,175.95849 -37.555382,176.763195 -37.881253,177.438813 -37.961248,178.010354 -37.579825,178.517094 -37.695373,178.274731 -38.582813,177.97046 -39.166343,177.206993 -39.145776,176.939981 -39.449736,177.032946 -39.879943,176.885824 -40.065978,176.508017 -40.604808,176.01244 -41.289624,175.239567 -41.688308,175.067898 -41.425895,174.650973 -41.281821,175.22763 -40.459236,174.900157 -39.908933,173.824047 -39.508854,173.852262 -39.146602,174.574802 -38.797683,174.743474 -38.027808,174.697017 -37.381129,174.292028 -36.711092,174.319004 -36.534824,173.840997 -36.121981,173.054171 -35.237125,172.636005 -34.529107,173.007042 -34.450662,173.551298 -35.006183,174.32939 -35.265496,174.612009 -36.156397)))");
assert_eq!(wkt, "GEOMETRYCOLLECTION(MULTIPOLYGON(((173.020375 -40.919052,173.247234 -41.331999,173.958405 -40.926701,174.247587 -41.349155,174.248517 -41.770008,173.876447 -42.233184,173.22274 -42.970038,172.711246 -43.372288,173.080113 -43.853344,172.308584 -43.865694,171.452925 -44.242519,171.185138 -44.897104,170.616697 -45.908929,169.831422 -46.355775,169.332331 -46.641235,168.411354 -46.619945,167.763745 -46.290197,166.676886 -46.219917,166.509144 -45.852705,167.046424 -45.110941,168.303763 -44.123973,168.949409 -43.935819,169.667815 -43.555326,170.52492 -43.031688,171.12509 -42.512754,171.569714 -41.767424,171.948709 -41.514417,172.097227 -40.956104,172.79858 -40.493962,173.020375 -40.919052)),((174.612009 -36.156397,175.336616 -37.209098,175.357596 -36.526194,175.808887 -36.798942,175.95849 -37.555382,176.763195 -37.881253,177.438813 -37.961248,178.010354 -37.579825,178.517094 -37.695373,178.274731 -38.582813,177.97046 -39.166343,177.206993 -39.145776,176.939981 -39.449736,177.032946 -39.879943,176.885824 -40.065978,176.508017 -40.604808,176.01244 -41.289624,175.239567 -41.688308,175.067898 -41.425895,174.650973 -41.281821,175.22763 -40.459236,174.900157 -39.908933,173.824047 -39.508854,173.852262 -39.146602,174.574802 -38.797683,174.743474 -38.027808,174.697017 -37.381129,174.292028 -36.711092,174.319004 -36.534824,173.840997 -36.121981,173.054171 -35.237125,172.636005 -34.529107,173.007042 -34.450662,173.551298 -35.006183,174.32939 -35.265496,174.612009 -36.156397))))");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect a MULTIPOLYGON here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you expect if the input FeatureCollection had a MultiPolygon and then a LineString?

// Feature Collections may contain multiple geometries, so we need to wrap
// them in a GeometryCollection to ensure we output valid geometry.
fn dataset_begin(&mut self, _name: Option<&str>) -> Result<()> {
self.geometrycollection_begin(0, 0)
Copy link
Member

@pka pka May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here's a confusion between FeatureCollections and GeometryCollections. The latter is a special geometry type for one feature (https://datatracker.ietf.org/doc/html/rfc7946#appendix-A.7).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll respond just here - since I think all three of your comments are essentially about the same issue.

I think here's a confusion between FeatureCollections and GeometryCollections

This was intentional - I am proposing that we explicitly interpret a feature collection as a geometry collection in WKT.

You're right that in geojson GeometryCollection and FeatureCollection are different things. But WKT, being concerned only with Geometry, has no concept of Feature or FeatureCollection.

So then, if we get a FeatureCollection from geojson and want to convert it to WKT, how should we do that?

Our current approach is to output a series of comma separated WKT geometries. Unfortunately, this output cannot be parsed by postgis, gdal, jts, nor georust::wkt (thus not by geozero itself!).

A corollary exists with the geo-types geometry writer. Similar to WKT, geo-types is strictly Geometry - no Feature/FeatureCollections. And in the same way, a geojson FeatureCollection with multiple features will be output as a top level geo_types::GeometryCollection.

Do you see a practical downside with the approach I'm proposing here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, strange, but if it's fixes WKT, then I can live with a hacky solution. But I would still expect, that my first remark in geojson_reader.rs would also affect other output formats?

Copy link
Member Author

@michaelkirk michaelkirk May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that this affects other output formats. My goal is to make sure all the "geometry only" formats — those which don't allow feature data: geo-types, WKT, WKB — behave sanely.

For example, here's something I consider a problem with the current to_wkb() implementation:

As input, consider this geojson feature collection with two Point features.

let geojson = r#"{        
    "type": "FeatureCollection",                
    "features": [                
        {                
            "type": "Feature",                
            "properties": {                
                "population": 100                
            },                
            "geometry": {                
                "type": "Point",                
                "coordinates": [10.0, -20.0]                
            }                
        },                
        {                
            "type": "Feature",                
            "properties": {                
                "population": 100                
            },                
            "geometry": {                
                "type": "Point",                
                "coordinates": [30.0, -40.0]                
            }                
        },                
    ]                
} "#;

Then this contrived use case:

let sql = format!("SELECT ST_Centroid({}::geometry)", geojson.to_wkb());

Does that seem like a reasonable use case? What centroid would you expect to be computed?

Currently it's:

$ SELECT ST_AsText(wkb_geometry) FROM( SELECT
    ST_Centroid('0101000000000000000000244000000000000034c001010000000000000000003e4000000000000044c0'::geometry) 
    AS wkb_geometry
) AS f;

   st_astext   
---------------
 POINT(10 -20)    <---- 🚨🚨🚨This is not what I'd expect!

Taking a closer look at the value of `geojson.to_wkb():

0101000000000000000000244000000000000034c001010000000000000000003e4000000000000044c0    

To break that down, the first part is:

// WKB of POINT(10.0, -20.0)
0101000000000000000000244000000000000034c0                                              

concatenated directly to:

// WKB of POINT(30.0, -40.0)
01010000000000000000003e4000000000000044c0    

I don't know if it's valid WKB to simply concatenate multiple WKB's together, but for better or worse, postgis actually seems to parse it:

$ SELECT ST_AsText(wkb_geometry)  FROM ( SELECT
    '0101000000000000000000244000000000000034c001010000000000000000003e4000000000000044c0'::geometry 
    AS wkb_geometry
) AS f;
   st_astext   
---------------
 POINT(10 -20)

The problem is that the second point has apparently just been ignored!

With the changes proposed in this PR, just like with the WKT example, the output of converting a FeatureCollection to WKB has changed to be wrapped in a GeometryCollection.

So now geojson.to_wkb():

$ SELECT ST_AsText(wkb_geometry) FROM (  SELECT
    '0107000000020000000101000000000000000000244000000000000034C001010000000000000000003E4000000000000044C0'::geometry 
    AS wkb_geometry
) AS f;
                    st_astext                    
-------------------------------------------------
 GEOMETRYCOLLECTION(POINT(10 -20),POINT(30 -40))

I think this is much more likely to behave how the user expects. The centroid selected from the example will be the centroid of all the geometries from all the features.

$ SELECT ST_AsText(wkb_geometry)  FROM ( SELECT
    ST_Centroid('0107000000020000000101000000000000000000244000000000000034C001010000000000000000003E4000000000000044C0'::geometry)
    AS wkb_geometry
) AS f;

   st_astext   
---------------
 POINT(20 -30)   <--- 👍 that's better

Do you agree that the current situation is problematic, or is my use case overly contrived?

Do you think there's a different solution?

FYI I pushed up another commit with some tests that show the new wkb behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this approach generally makes sense to you, I can add some more thorough testing to the other output formats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation and the good example. The main problem is IMO, that we currently do not distinct between multi-feature (dataset) capable formats like GeoJSON and single feature formats like WKB. It's a good moment to find a proper solution for that, since I'm replicating the GeozeroDatasource and FeatureProcessor traits based on the new GeomEventProcessor in the coming days. What you're describing is collecting geometries from multiple features into a single feature GeometryCollection. I also think that's a valid use case, but maybe it should be made explict for the user. My current idea for the event API is that we'll have in addtion to to_wkb() an extended method to_wkb_with(v: dyn GeomVisitor), which will allow to pass a processing chain e.g. including a reprojection. Maybe a GeometryCollector could also be a part of a GeomVisitor chain, or it could be a visitor flag, or a special method like collect_to_wkb().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collecting geometries from multiple features into a single feature GeometryCollection. I also think that's a valid use case, but maybe it should be made explict for the user.

It seems like you have some reservations about this - so I want to make sure we're on the same page.

Setting aside implementation details for now, and just focusing on behavior: I think it's a good expectation that all format conversions (e.g. to_wkb()) should produce valid output by default without needing to pass in any kind of special options. Why would someone want invalid wkb? Do you agree with that part?

Test cases and examples can be helpful. Can you help think of some code that behaves surprisingly with the proposed behavior - I think that'd help me understand your reservations about making this default behavior. And maybe then I could be able to address them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any questions I can answer or changes I can make to move this forward?

(Also, I just now rebased against latest.)

This allows WKT to emit a single top level item
And allows for lossless roundtripping of geojson->fgb->geojson
@pka
Copy link
Member

pka commented Sep 15, 2022

I think, my first review comment is still relevant. Forcing the output of any FeatureCollection to a GeometryCollection seems not right to me.

@michaelkirk
Copy link
Member Author

michaelkirk commented Sep 15, 2022

I think, my first review comment is still relevant. Forcing the output of any FeatureCollection to a GeometryCollection seems not right to me.

What output are you expecting when mapping a feature collection to a geometry-only format like WKT?

Here's the current behavior of geo-types:

   let geojson_string = serde_json::json!({
            "type": "FeatureCollection",
            "features": [
                {
                    "type": "Feature",
                    "properties": {},
                    "geometry": {
                        "type": "Point",
                        "coordinates": [1,2]
                    }
                },
                {
                    "type": "Feature",
                    "properties": {},
                    "geometry": {
                        "type": "LineString",
                        "coordinates": [[1,2], [3, 4]]
                    }
                },
            ]
        }).to_string();

let geojson = crate::geojson::GeoJson(&geojson_string);
let expected = geo_types::Geometry::GeometryCollection(geo_types::GeometryCollection(vec![
    point! { x: 1.0, y: 2.0 }.into(),
    line_string![(x: 1.0, y: 2.0), (x: 3.0, y: 4.0)].into()
]));
assert_eq!(expected, geojson.to_geo().unwrap());

It seems reasonable to me. geo-types similarly used to behave incorrectly for multiple geometries. It confused people (#11). For geo-types, the fix was: since multiple geometries cannot exist at the top-level they must be wrapped in a GeometryCollection. I'm proposing the same fix here.

@pka
Copy link
Member

pka commented Sep 15, 2022

The question is not whether the output format is geometry only, but whether multiple features are supported. If you e.g. use the GDAL CSV driver with WKT geometries (ogr2ogr -f CSV -lco "GEOMETRY=AS_WKT" fc.csv fc.json) you get two records:

WKT,
"POINT (1 2)"
"LINESTRING (1 2,3 4)"

If you want to collect the geometries of all features, then a GeometryCollection is the right thing. But when you convert a GeoJSON FeatureCollection to e.g. FlatGeoBuf you expect the original geometry types and not a GeometryCollection.

Copy link
Member Author

@michaelkirk michaelkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple years has passed... I keep stumbling into this from time to time. It surprised me again today.

I still think we should address this - it's a real problem that geozero-cli outputs invalid WKT and WKB.

If we're outputting geometry (as opposed to features), we need some kind of container to match the fact that there could be potentially multiple features represented in the output geometry.

@@ -372,7 +365,7 @@ mod test {
let mut wkt_data: Vec<u8> = Vec::new();
assert!(read_geojson_fc(geojson.as_bytes(), &mut WktWriter::new(&mut wkt_data)).is_ok());
let wkt = std::str::from_utf8(&wkt_data).unwrap();
assert_eq!(wkt, "MULTIPOLYGON(((173.020375 -40.919052,173.247234 -41.331999,173.958405 -40.926701,174.247587 -41.349155,174.248517 -41.770008,173.876447 -42.233184,173.22274 -42.970038,172.711246 -43.372288,173.080113 -43.853344,172.308584 -43.865694,171.452925 -44.242519,171.185138 -44.897104,170.616697 -45.908929,169.831422 -46.355775,169.332331 -46.641235,168.411354 -46.619945,167.763745 -46.290197,166.676886 -46.219917,166.509144 -45.852705,167.046424 -45.110941,168.303763 -44.123973,168.949409 -43.935819,169.667815 -43.555326,170.52492 -43.031688,171.12509 -42.512754,171.569714 -41.767424,171.948709 -41.514417,172.097227 -40.956104,172.79858 -40.493962,173.020375 -40.919052)),((174.612009 -36.156397,175.336616 -37.209098,175.357596 -36.526194,175.808887 -36.798942,175.95849 -37.555382,176.763195 -37.881253,177.438813 -37.961248,178.010354 -37.579825,178.517094 -37.695373,178.274731 -38.582813,177.97046 -39.166343,177.206993 -39.145776,176.939981 -39.449736,177.032946 -39.879943,176.885824 -40.065978,176.508017 -40.604808,176.01244 -41.289624,175.239567 -41.688308,175.067898 -41.425895,174.650973 -41.281821,175.22763 -40.459236,174.900157 -39.908933,173.824047 -39.508854,173.852262 -39.146602,174.574802 -38.797683,174.743474 -38.027808,174.697017 -37.381129,174.292028 -36.711092,174.319004 -36.534824,173.840997 -36.121981,173.054171 -35.237125,172.636005 -34.529107,173.007042 -34.450662,173.551298 -35.006183,174.32939 -35.265496,174.612009 -36.156397)))");
assert_eq!(wkt, "GEOMETRYCOLLECTION(MULTIPOLYGON(((173.020375 -40.919052,173.247234 -41.331999,173.958405 -40.926701,174.247587 -41.349155,174.248517 -41.770008,173.876447 -42.233184,173.22274 -42.970038,172.711246 -43.372288,173.080113 -43.853344,172.308584 -43.865694,171.452925 -44.242519,171.185138 -44.897104,170.616697 -45.908929,169.831422 -46.355775,169.332331 -46.641235,168.411354 -46.619945,167.763745 -46.290197,166.676886 -46.219917,166.509144 -45.852705,167.046424 -45.110941,168.303763 -44.123973,168.949409 -43.935819,169.667815 -43.555326,170.52492 -43.031688,171.12509 -42.512754,171.569714 -41.767424,171.948709 -41.514417,172.097227 -40.956104,172.79858 -40.493962,173.020375 -40.919052)),((174.612009 -36.156397,175.336616 -37.209098,175.357596 -36.526194,175.808887 -36.798942,175.95849 -37.555382,176.763195 -37.881253,177.438813 -37.961248,178.010354 -37.579825,178.517094 -37.695373,178.274731 -38.582813,177.97046 -39.166343,177.206993 -39.145776,176.939981 -39.449736,177.032946 -39.879943,176.885824 -40.065978,176.508017 -40.604808,176.01244 -41.289624,175.239567 -41.688308,175.067898 -41.425895,174.650973 -41.281821,175.22763 -40.459236,174.900157 -39.908933,173.824047 -39.508854,173.852262 -39.146602,174.574802 -38.797683,174.743474 -38.027808,174.697017 -37.381129,174.292028 -36.711092,174.319004 -36.534824,173.840997 -36.121981,173.054171 -35.237125,172.636005 -34.529107,173.007042 -34.450662,173.551298 -35.006183,174.32939 -35.265496,174.612009 -36.156397))))");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you expect if the input FeatureCollection had a MultiPolygon and then a LineString?

@michaelkirk
Copy link
Member Author

michaelkirk commented Oct 31, 2024

The question is not whether the output format is geometry only, but whether multiple features are supported. If you e.g. use the GDAL CSV driver with WKT geometries (ogr2ogr -f CSV -lco "GEOMETRY=AS_WKT" fc.csv fc.json) you get two records:

Does the comparison to ogr2ogr make sense here? I'm not talking about compressing multiple features into a single feature in the output. I'm talking about having valid geometry output when converting to format that doesn't support features at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants