Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoArrow WKB reader #39

Merged
merged 5 commits into from
Jul 20, 2022
Merged

Conversation

kylebarron
Copy link
Member

@kylebarron kylebarron commented Jun 2, 2022

For #37. As a disclaimer, this is my first Rust PR to a non-personal project, so suggestions welcomed 🙂. I'm new to the geozero API as well, so I tried to model after existing code. I figured I'd put up a draft PR to start some discussion.

Most likely GeoArrow/GeoParquet will end up having two geometry formats: WKB for universal compatibility and a "native" Arrow encoding for zero-copy performance. (For context, GeoArrow is the memory format, GeoParquet is the file format; GeoParquet decodes into GeoArrow). Is I mentioned in #37, the arrow-native encoding is faster because it's constant time access to any individual coordinate. (Though note that the arrow-native encoding is still provisional).

So it seems like there are four parts:

  • GeoArrow WKB reader
  • GeoArrow WKB writer
  • GeoArrow Arrow-native reader
  • GeoArrow Arrow-native writer

The WKB reader seems to be the easiest, so this PR starts there.

Open questions:

  • I'm not sure the right way to handle a vector of geometries. I suppose the current use of geometrycollection_begin and geometrycollection_end exposes the collection of geometries as a geo GeometryCollection?

    Or should this implement GeozeroDatasource instead? The issue with that is that that seems to need to implement a feature processor. And it's unclear whether the struct should require a single array or a chunk with multiple arrays. Should there be different structs for just the geometry array and for the entire table?

  • Since the struct is defined as

    pub struct GeoArrowWkb(pub BinaryArray<i32>);

    this takes ownership of the arrow array right? Seems like it would be preferable to take a reference here, but I don't really understand lifetimes yet 😅 .

  • There are two different rust Arrow implementations: arrow and arrow2. They're both pretty feature complete, so ideally geozero wouldn't enforce one or the other. I'm not sure the best way to do this... feature flags?

@pka
Copy link
Member

pka commented Jun 2, 2022

Wow, this was quick! Could you add some test data? Then I can play around and maybe answer your questions with code.

@kylebarron
Copy link
Member Author

kylebarron commented Jun 2, 2022

Could you add some test data?

I've been doing simple tests with a point dataset of capital cities, which is the result of

import geopandas as gpd
gdf = gpd.read_file(gpd.datasets.get_path('naturalearth_cities'))
gdf.to_feather('cities.arrow', compression='uncompressed')

Alternatively, you should be able to convert any file to a WKB-encoded GeoArrow file with GDAL 3.5.0. Note that it defaults to the provisional arrow-native encoding, so you need to force output as WKB:

docker run --rm -v $(pwd):/data osgeo/gdal:latest \
  ogr2ogr \
  /data/output.arrow \
  /data/input.shp \
  -lco COMPRESSION=NONE \
  -lco GEOMETRY_ENCODING=WKB

And then you can use the arrow2 IPC FileReader to iterate over the Arrow record batches in the file.

@michaelkirk
Copy link
Member

It'd be great if you could include some small sample data and parsing tests like that directly in your PR. e.g. we have some sample data for the other formats here: https://github.com/georust/geozero/tree/master/geozero/tests/data

@pka
Copy link
Member

pka commented Jun 2, 2022

My favorite dataset is

docker run --rm -v $(pwd):/data osgeo/gdal:latest \
  ogr2ogr \
  /data/countries.arrow \
  /data/countries.fgb \
  -lco COMPRESSION=NONE \
  -lco GEOMETRY_ENCODING=WKB

@pka
Copy link
Member

pka commented Jun 2, 2022

Wrote a first test:

#[cfg(test)]
mod test {
    use super::*;
    use crate::wkt::conversion::ToWkt;
    use arrow2::io::ipc::read;
    use std::fs::File;

    #[test]
    fn multipoly_file() -> arrow2::error::Result<()> {
        let mut file = File::open("tests/data/countries.arrow")?;
        let metadata = read::read_file_metadata(&mut file)?;
        let mut reader = read::FileReader::new(file, metadata, None);

        let columns = reader.next().unwrap()?;

        let array = &columns.arrays()[2];
        let wkbarr = array.as_any().downcast_ref::<BinaryArray<i32>>().unwrap();
        let wkt = wkbarr.to_wkt().unwrap();
        assert_eq!(
            &wkt[0..100],
            "GEOMETRYCOLLECTION(MULTIPOLYGON(((-59.572095 -80.040179,-59.865849 -80.549657,-60.159656 -81.000327,"
        );
        assert_eq!(
            &wkt[wkt.len()-100..],
            "-51.5,-58.55 -51.1,-57.75 -51.55,-58.05 -51.9,-59.4 -52.2,-59.85 -51.85,-60.7 -52.3,-61.2 -51.85))))"
        );
        Ok(())
    }
}

Did the following changes:

--- a/geozero/Cargo.toml
+++ b/geozero/Cargo.toml
@@ -58,6 +58,7 @@ flatgeobuf = "0.8.0"
 postgres = "0.19"
 sqlx = { version = "0.5", default-features = false, features = [ "runtime-tokio-native-tls", "macros", "time", "postgres", "sqlite" ] }
 tokio = { version = "1.17.0", default-features = false, features = ["rt-multi-thread"] }
+arrow2 = { version = "0.11.2", features = ["io_ipc"] }
 
 [build-dependencies]
 prost-build = "0.10"
diff --git a/geozero/src/arrow/geoarrow_reader.rs b/geozero/src/arrow/geoarrow_reader.rs
index 47608bf..a61e04c 100644
--- a/geozero/src/arrow/geoarrow_reader.rs
+++ b/geozero/src/arrow/geoarrow_reader.rs
@@ -9,11 +9,11 @@ use arrow2::array::BinaryArray;
 #[derive(Debug)]
 pub struct GeoArrowWkb(pub BinaryArray<i32>);
 
-impl GeozeroGeometry for GeoArrowWkb {
+impl GeozeroGeometry for BinaryArray<i32> {
     fn process_geom<P: GeomProcessor>(&self, processor: &mut P) -> Result<()> {
         // TODO: how does this fn know which index to pick?
         // Or is this fn intended to work on a single geometry?
-        process_geoarrow_wkb_geom(&self.0, processor)
+        process_geoarrow_wkb_geom(&self, processor)
     }
 }

@kylebarron
Copy link
Member Author

-impl GeozeroGeometry for GeoArrowWkb {
+impl GeozeroGeometry for BinaryArray<i32> {

Interesting! So we don't need to define our own struct at all? We can just implement our traits for the arrow types directly.

Along the same lines, seems like there should be a way then to implement traits on both arrow2's and arrow's BinaryArray. Then it seems it would work out of the box for whichever package the user is already using.

@pka
Copy link
Member

pka commented Jun 2, 2022

Interesting! So we don't need to define our own struct at all? We can just implement our traits for the arrow types directly.

Right, we can directly implement the trait for the arrow types.

Along the same lines, seems like there should be a way then to implement traits on both arrow2's and arrow's BinaryArray. Then it seems it would work out of the box for whichever package the user is already using.

Maybe we have to repeat the code for the different types or write a macro it there are too many similar implementations.

@pka
Copy link
Member

pka commented Jun 2, 2022

Found a generic way:

impl GeozeroGeometry for BinaryArray<i64> {
    fn process_geom<P: GeomProcessor>(&self, processor: &mut P) -> Result<()> {
        process_geoarrow_wkb_geom(&self, processor)
    }
}

fn process_geoarrow_wkb_geom<T: Offset>(
    array: &BinaryArray<T>,
    processor: &mut impl GeomProcessor,
) -> Result<()> {
  // ...
}

@kylebarron kylebarron marked this pull request as ready for review July 17, 2022 18:42
@kylebarron
Copy link
Member Author

I added a countries.arrow test data file (with geometries in WKB) generated from the above command. I made the suggested changes and put the impl directly on BinaryArray for i32 and i64.

In the future, at least once geoarrow/geoarrow#22 is merged, we could also put an impl on the arrow ExtensionType, so that the conversion also works on arrays of data type ExtensionType("geoarrow.wkb", BinaryArray<T>).

Let me know if you'd like more tests than the one @pka suggested above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants