-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GeoArrow WKB reader #39
Conversation
Wow, this was quick! Could you add some test data? Then I can play around and maybe answer your questions with code. |
I've been doing simple tests with a point dataset of capital cities, which is the result of import geopandas as gpd
gdf = gpd.read_file(gpd.datasets.get_path('naturalearth_cities'))
gdf.to_feather('cities.arrow', compression='uncompressed') Alternatively, you should be able to convert any file to a WKB-encoded GeoArrow file with GDAL 3.5.0. Note that it defaults to the provisional arrow-native encoding, so you need to force output as WKB: docker run --rm -v $(pwd):/data osgeo/gdal:latest \
ogr2ogr \
/data/output.arrow \
/data/input.shp \
-lco COMPRESSION=NONE \
-lco GEOMETRY_ENCODING=WKB And then you can use the |
It'd be great if you could include some small sample data and parsing tests like that directly in your PR. e.g. we have some sample data for the other formats here: https://github.com/georust/geozero/tree/master/geozero/tests/data |
My favorite dataset is
|
Wrote a first test: #[cfg(test)]
mod test {
use super::*;
use crate::wkt::conversion::ToWkt;
use arrow2::io::ipc::read;
use std::fs::File;
#[test]
fn multipoly_file() -> arrow2::error::Result<()> {
let mut file = File::open("tests/data/countries.arrow")?;
let metadata = read::read_file_metadata(&mut file)?;
let mut reader = read::FileReader::new(file, metadata, None);
let columns = reader.next().unwrap()?;
let array = &columns.arrays()[2];
let wkbarr = array.as_any().downcast_ref::<BinaryArray<i32>>().unwrap();
let wkt = wkbarr.to_wkt().unwrap();
assert_eq!(
&wkt[0..100],
"GEOMETRYCOLLECTION(MULTIPOLYGON(((-59.572095 -80.040179,-59.865849 -80.549657,-60.159656 -81.000327,"
);
assert_eq!(
&wkt[wkt.len()-100..],
"-51.5,-58.55 -51.1,-57.75 -51.55,-58.05 -51.9,-59.4 -52.2,-59.85 -51.85,-60.7 -52.3,-61.2 -51.85))))"
);
Ok(())
}
} Did the following changes: --- a/geozero/Cargo.toml
+++ b/geozero/Cargo.toml
@@ -58,6 +58,7 @@ flatgeobuf = "0.8.0"
postgres = "0.19"
sqlx = { version = "0.5", default-features = false, features = [ "runtime-tokio-native-tls", "macros", "time", "postgres", "sqlite" ] }
tokio = { version = "1.17.0", default-features = false, features = ["rt-multi-thread"] }
+arrow2 = { version = "0.11.2", features = ["io_ipc"] }
[build-dependencies]
prost-build = "0.10"
diff --git a/geozero/src/arrow/geoarrow_reader.rs b/geozero/src/arrow/geoarrow_reader.rs
index 47608bf..a61e04c 100644
--- a/geozero/src/arrow/geoarrow_reader.rs
+++ b/geozero/src/arrow/geoarrow_reader.rs
@@ -9,11 +9,11 @@ use arrow2::array::BinaryArray;
#[derive(Debug)]
pub struct GeoArrowWkb(pub BinaryArray<i32>);
-impl GeozeroGeometry for GeoArrowWkb {
+impl GeozeroGeometry for BinaryArray<i32> {
fn process_geom<P: GeomProcessor>(&self, processor: &mut P) -> Result<()> {
// TODO: how does this fn know which index to pick?
// Or is this fn intended to work on a single geometry?
- process_geoarrow_wkb_geom(&self.0, processor)
+ process_geoarrow_wkb_geom(&self, processor)
}
} |
-impl GeozeroGeometry for GeoArrowWkb {
+impl GeozeroGeometry for BinaryArray<i32> { Interesting! So we don't need to define our own struct at all? We can just implement our traits for the arrow types directly. Along the same lines, seems like there should be a way then to implement traits on both |
Right, we can directly implement the trait for the arrow types.
Maybe we have to repeat the code for the different types or write a macro it there are too many similar implementations. |
Found a generic way: impl GeozeroGeometry for BinaryArray<i64> {
fn process_geom<P: GeomProcessor>(&self, processor: &mut P) -> Result<()> {
process_geoarrow_wkb_geom(&self, processor)
}
}
fn process_geoarrow_wkb_geom<T: Offset>(
array: &BinaryArray<T>,
processor: &mut impl GeomProcessor,
) -> Result<()> {
// ...
} |
I added a In the future, at least once geoarrow/geoarrow#22 is merged, we could also put an Let me know if you'd like more tests than the one @pka suggested above. |
For #37. As a disclaimer, this is my first Rust PR to a non-personal project, so suggestions welcomed 🙂. I'm new to the geozero API as well, so I tried to model after existing code. I figured I'd put up a draft PR to start some discussion.
Most likely GeoArrow/GeoParquet will end up having two geometry formats: WKB for universal compatibility and a "native" Arrow encoding for zero-copy performance. (For context, GeoArrow is the memory format, GeoParquet is the file format; GeoParquet decodes into GeoArrow). Is I mentioned in #37, the arrow-native encoding is faster because it's constant time access to any individual coordinate. (Though note that the arrow-native encoding is still provisional).
So it seems like there are four parts:
The WKB reader seems to be the easiest, so this PR starts there.
Open questions:
I'm not sure the right way to handle a vector of geometries. I suppose the current use of
geometrycollection_begin
andgeometrycollection_end
exposes the collection of geometries as ageo
GeometryCollection
?Or should this implement
GeozeroDatasource
instead? The issue with that is that that seems to need to implement a feature processor. And it's unclear whether the struct should require a single array or a chunk with multiple arrays. Should there be different structs for just the geometry array and for the entire table?Since the struct is defined as
this takes ownership of the arrow array right? Seems like it would be preferable to take a reference here, but I don't really understand lifetimes yet 😅 .
There are two different rust Arrow implementations:
arrow
andarrow2
. They're both pretty feature complete, so ideallygeozero
wouldn't enforce one or the other. I'm not sure the best way to do this... feature flags?