-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
saving and reading vector data cubes #26
Comments
This comment was marked as off-topic.
This comment was marked as off-topic.
I'm interested in following this. Do you have an example of an existing NetCDF file that has the sort of encoded geometries you'd like to produce? That would be a very useful place to start. Presumably Xarray can open that file--it just doesn't decode the data into anything useful for computation. I think the general pattern we're looking for is to create an xarray accessor that manages the encoding / decoding of the geometry data from this "raw" state (which is serializable) to something useful, which is not. The pint-xarray package provides a template of what this might look like. # Open the raw data,
ds = xr.open_dataset("encoded.nc")
# convert the geometry data to something useful
ds_decoded = ds.xvec.decode_geometries()
# the inverse operation
ds_encoded = ds_decoded.xvec.encode_geometries()
xr.testing.assert_equal(ds, ds_encoded)
ds_encoded.to_netcdf("round_trip.nc") To make this more convenient for the user, this could be wrapped in an xarray pluggable backend for I/O. One problem with our ecosystem is that there is no one central package that manages this sort of decoding for geospatial data. Rioxarray does a bunch, leveraging gdal. Odc-geo does things a little differently. Metpy has its own handling of CRS and other geo stuff. So does CF Xarray. Makes me wish geo-xarray had not turned into a zombie package. It's really needed. Anyway, I recommend following the path I sketched above to prototype this functionality. |
@aulemahal already taught CF-xarray to do this but only for "point geometries". IMO we should extend it to support other things (PR welcome!). Granted it's a helper function right now, but we could easily add Part of the point of cf-xarray is to have a single place for these kinds of utilities that can be used in many places. |
Just for reference, the other day I had a very similar question to @edzer in the |
@e-kotov no, that is not what I said, this only applies to the case where more than one dimension is associated with vector geometries. For the (most common) use case of a single vector dimension, CDL as described in the CF conventions @martinfleis refers to can be perfectly used, leading to NetCDF and Zarr implementations (they both follow CDL / CF). Special cases such as weird raster cells specified by their vertices - think hexagons, DGGS - can also be covered by that. If we want to write and read more than one vector geometry dimension we can still use CDL (NetCDF, Zarr) but would have to go beyond CF conventions. Nothing wrong with that, if we have a strong use case we can propose to extend CF. In R, I implemented |
Thanks all! We can split the issue into two problems:
For the first one, I have crated an example NetCDF encoding polygons using this R code (in library(stars)
nc = st_read(system.file("gpkg/nc.gpkg", package="sf"))
location = st_geometry(nc)
mode = c("car", "bike", "foot")
day = 1:100
hour = 0:23
dims = st_dimensions(location = location, mode = mode, day = day, hour = hour)
n = dim(dims)
traffic = array(rpois(prod(n), 10), dim = n) # simulated traffic counts
ds = st_as_stars(list(traffic = traffic), dimensions = dims)
write_mdim(ds, "stars_cube.nc") The file is attached (had to manually zip it to upload) -> stars_cube.nc.zip Xarray does read it and it seems that we should be fairly easily able to convert that information into an array of shapely geometries with pyproj.CRS assigned via xvec. ds_stars = xr.open_dataset("stars_cube.nc")
print(ds_stars)
<xarray.Dataset>
Dimensions: (hour: 24, day: 100, mode: 3, location: 100, node: 2529,
part: 108)
Coordinates:
* location (location) float32 1.0 2.0 3.0 4.0 ... 97.0 98.0 99.0 100.0
* mode (mode) object 'car' 'bike' 'foot'
* day (day) float32 1.0 2.0 3.0 4.0 5.0 ... 97.0 98.0 99.0 100.0
* hour (hour) float32 0.0 1.0 2.0 3.0 4.0 ... 20.0 21.0 22.0 23.0
* node (node) float32 1.0 2.0 3.0 ... 2.528e+03 2.529e+03
x (node) float32 ...
y (node) float32 ...
Dimensions without coordinates: part
Data variables:
traffic (hour, day, mode, location) float32 ...
crs |S1 ...
node_count (location) float32 ...
part_node_count (part) float32 ...
interior_ring (part) float32 ...
geometry float32 ...
Attributes:
Conventions: CF-1.6 where print(ds_stars.crs)
<xarray.DataArray 'crs' ()>
[1 values with dtype=|S1]
Attributes:
grid_mapping_name: latitude_longitude
long_name: CRS definition
longitude_of_prime_meridian: 0.0
semi_major_axis: 6378206.4
inverse_flattening: 294.978698213898
spatial_ref: GEOGCS["NAD27",DATUM["North_American_Datum_...
crs_wkt: GEOGCS["NAD27",DATUM["North_American_Datum_... can be interpreted by pyproj and And then do the same the other way. For the second one (multiple geometry dimension in a file), I believe that should be discussed separately and probably somewhere else. We can then implement here or in other packages whatever spec a community agrees on. |
What about a custom codec (filter) for encoding / decoding the 1-d geometry coordinates of the vector data cubes? Regarding zarr (python), this can be registered dynamically via numcodec. For hdf5 / netcdf4 I'm less sure but it looks like custom filters are supported too (e.g., https://github.com/silx-kit/hdf5plugin)? The geoparquet format currently encodes geometries in the WKB format, which is used by geopandas to read/write parquet and feather files. For a Python/Zarr prototype, I guess we could adapt some logic from geopandas and register a custom numcodec codec that encodes (decodes) to (from) WKB? Such a codec would nicely fit in Xvec actually. Since it is handled by zarr, there wouldn't be anything specific to do on the Xarray side to convert the raw vector data. Perhaps the same can be done for h5py (netcdf4)? I'm not sure how best to deal with metadata, though. The geozarr specs seems a nice place to discuss / address this. For vector-specific metadata, it might be worth looking at the geoarrow specs. I'm not familiar with the CF conventions for geometries, but at a first glance it looks a bit more convoluted (and maybe less flexible?) than the custom codec approach. More generally, it would be nice if we could leverage the best from both Arrow/Parquet and Zarr (HDF5/NetCDF). Those formats are de-facto pretty standard or on the way of being so. It that's possible, it looks like an option that would be both efficient and reasonably portable for vector data cube I/O. |
I agree interoperability is a great goal to strive for. I have been assuming that data cubes cannot be serialized to Parquet / Arrow because those formats are fundamentally aimed at dataframes (and not ND-Arrays). Maybe vector data are different however. If your data fit into Parquet, Pandas is probably the right choice, rather than Xarray. Xarray's internal data model is based on NetCDF. However, we can serialize this data model to many different formats. Something like this. graph TD;
CF-conventions--> NetCDF-Data-Model;
NetCDF-Data-Model-->NetCDF3-Binary-Format
NetCDF-Data-Model-->HDF5;
NetCDF-Data-Model-->Zarr;
NetCDF-Data-Model-->Arrow?
That said, I'd love for someone to show me how to serialize the NetCDF data model to Arrow / Parquet.
This is an interesting idea. Should all of Xarray's CF decoding be re-implemented as custom codecs? I tend to think of codecs as decompression-related. Packing too much logic into the codec concept feels a bit fragile. On the other hand, Xarray's convention encoding / decoding logic is remarkably similar to how codecs work, so maybe this is worth pursuing 🤔 . |
One additional point... If an existing standard exists for this problem we should absolutely try to use it, rather than inventing a new standard! I don't need to share the XKCD cartoon which we all know by heart by now. Edit: I read @edzer's comment more carefully:
...and now I understand that indeed a new standard is needed. My experience with the CF conventions is that the timescale for something new to be included is about 2 years. I assume we want to move quicker here, so coming up with an ad-hoc extension to those conventions and demonstrating interoperability between R and Python seems like a great place to start. |
I think the strength of NetCDF and Zarr is that you only need the NetCDF library to read everything, nothing else. If you would introduce something like WKB in it (assuming you can: with a bit array? But how to deal with the ragged nature?) would mean that someone just using the NetCDF library no longer can interpret the geometries. With the CF geometry conventions, you can. I wouldn't start thinking about giving up this level of interoperability: NetCDF will soon turn 50. The CF geometry conventions are really, AFAICT, the simplest you can get. I believe that Arrow takes the same approach for speed reasons (right @paleolimbot?). The advantage over WKB is that with explicit indexes you can easily (directly) jump to the last sub-polygon, for instance; with WKB you'd have to go through the full set of rings before finding the last. These CF geometries give up on the difference between LINESTRING and MULTILINESTRING, and between POLYGON and MULTIPOLYGON, so no clean roundtrips of simple features but you could argue that those distinctions are a luxury (and often a hassle) anyway. @martinfleis great example; for testing it would make sense to also have a polygon with holes, which this dataset doesn't have.
I'll see if I can think of an approach how this could be done with CDL / NetCDF building on the CF approach. Maybe adding one (top) level of nesting, indicating the dimension involved, would be enough. @rabernat I also understand that Parquet/Arrow cater for columnar data, not arrays. You can put an array in a column but you'll loose all its advantages. |
I think that the only thing we could borrow from Arrow is Geo-Arrow geometry encoding but it is incompatible with CF encoding. Arrow uses the interleaved coordinate values ([x, y, x, y, ...],). Btw, this discussion may be relevant as well on saving raster as Arrow geoarrow/geoarrow#24 I would also try to stick as much was we can with existing standards. The CF geometry encoding looks straightforward to work with and if we can find a way of saving multiple dimensions, we can implement it in R and Python as a de facto standard and try to get it in CF conventions later, if that would slow down things. We just need to ensure that our extension does not break reading of the same file elsewhere. |
Fair enough. You have more experience than me with WKB, CF (geometry), etc. to know what would be the best option.
Is it settled? There seems to be a long unresolved discussion here: geoarrow/geoarrow#26. I was also naively wondering how compatible are the CF conventions vs. geo-arrow specs for encoding geometries and how concerning would be a poor compatibility regarding performance and/or other aspects? Especially if we expect many back and forth conversions between data cubes <-> data frames in workflows that involve vector data (usually more than for gridded/raster data)? |
Probably not entirely (I guess that is the main reason why GeoParquet 1.0 still uses WKB) but I lost track of that a bit. I think that at least GeoPolars and cuSpatial already use the interleaved option.
We have three sides in here, not two. One is CF representation, another Geo-Arrow but we also need to keep in mind that in most cases, we need GEOS geometries to work with (that applies to both Python and R). So imho fast conversion between CF and GEOS (shapely) is more important in our ecosystem than CF to Geo-Arrow. And for shapely, having separate x and y arrays is better than an interleaved array that would need to be split into separate arrays anyway (or 2D array but that still involves reshuffling). |
Yes that's true for shapely. For S2 (s2geography / spherely) I think it is possible to reuse external, in-memory coordinate data. I vaguely remember about @paleolimbot sharing some ideas about using S2 directly with Arrow structures (https://dewey.dunnington.ca/post/2021/prototyping-an-apache-arrow-representation-of-geometry/). |
I just wanted to chime in that if interleaved coordinates become a blocking feature for a lot of people toward using GeoArrow, that's a strong reason to drop that requirement. |
Other than just reading them now, I'm not all that familiar with the CF conventions for geometry...it looks like the only difference is that it stores "lengths" instead of "cumulative lengths", and as @edzer noted, there is no distinction between multi and non-multi types and no support for mixed types. I think Edzer's point about interoperability is huge: no need for a WKB parser! And nothing does ndarrays like NetCDF. From reading this thread I can't help but side with NetCDF + CF conventions (but I'm very new to this discussion).
Yes, S2 is very flexible about its data structure (basically every edge is a virtual method call...it makes the bet that many of those edges will never need to be resolved). One could write an S2Geography subclass that lazily reads from the file. That said, I don't think it's a good use of anybody's time to actually do that.
My current prediction is that the GeoArrow specification will allow for both, and there will be a small vendorable library that will make it trivial to translate between WKB, WKT, and both forms of coordinate representations. That small vendorable library ( https://github.com/geoarrow/geoarrow-cpp ) is progressing as fast as my kids capability for napping/sleeping allows, which is at present very very slow. |
Hi All. I think the discussion here is largely on target. We designed the CF geometry spec with simplicity and the 90% level of use case support in mind. @edzer is totally on point with regards to the design theory and accessibility of the encoding here. https://github.com/martinfleis/xvec/issues/26#issuecomment-1422599146 but not quite correct that multi types aren’t supported. See the examples in the vignette linked below for details. A couple resources. A poster that Tim and I prepared that talks about the use cases and implementation strategy. An R package vignette for my R package that implements the core CF geometry spec. The package is built into {stars} pure netcdf reader. Th package also writes the netcdf geometry format with time series. This vignette and the package test data might be a good place for someone to start. I think the same tests are in Tim’s package. Tim’s python reference implementation docs. Happy to help clarify things or cross validate tests. Note that I also worked with a student a few years back to get the core of the cf geometry into gdal (for better or worse) 😅 The format really hasn’t caught hold, but I still think it has a place in the mix. It’s dead simple and pretty easy to read / write from wkb/wkt structures. Noting some comments about adding further dimensionality. No reason not to. Just call it a custom extension and move on. If it catches on, bring it to CF. I’d be happy to help evolve that part of the spec. Cheers! |
I'm still working on wrapping my head around the specifics of vector data cubes (and in general rather unfamiliar with CF etc), but naively it seems like you might still be able to use Arrow. In a simple case, where the geometry column maps against the first axis of the data cube, you could store the data cube in a nested fixed size list column (essentially the raster/tensor type). Then as I understand it, the entire column's data buffer would be equivalent to a C-order multidimensional array. You could store the geometries in any format you'd like.
Right now geopolars uses a separated/struct array because polars doesn't support the |
I played around a bit with cf_xarray and xvec and made the following example work. The sample file I'm using is from the netcdfgeom R package: https://github.com/DOI-USGS/ncdfgeom/tree/main/inst/extdata Note that this file doesn't actually conform to the CF geometries spec, so it has to be augmented before it can be used. import xarray as xr
import cf_xarray
import xvec
# to download
# wget https://github.com/DOI-USGS/ncdfgeom/raw/main/inst/extdata/example_huc_eta.nc
ds = xr.open_dataset('example_huc_eta.nc')
# assign missing geometry_container variable
# Note: we omit node_count because these are point data
ds.coords['geometry_container'] = 0
ds.geometry_container.attrs['geometry_type'] = 'point'
ds.geometry_container.attrs['node_coordinates'] = 'lat lon'
# use cf_xarray to convert to shapely geometries
geometries = cf_xarray.cf_to_shapely(ds)
# (because node_count was missing, the dimension name gets auto-assigned to "features")
# now make this a coordinate variable on the original dataset
ds.coords['geometry'] = geometries.rename({"features": "station"})
# Now make this a dimension, which creates a new Pandas Index
# (It doesn't make sense to me to first create the pandas index. Is this efficient?)
ds_swapped = ds.swap_dims({"station": "geometry"})
# now set the geometry index
ds_cube = ds_swapped.xvec.set_geom_indexes("geometry", crs="4326") This shows how it is plausible to decode CF geometries from any Xarray dataset into an xvec
Once those two are done, we would want to decide where to create the xvec indexes. It could possibly be done by cf_xarray during the ds = xr.open_dataset("file_with_cf_geometries.nc")
ds = ds.cf.decode_geometries()
ds = ds.xvec.set_geom_indexes() |
I think geoarrow can currently get you a python -m pip install "https://github.com/geoarrow/geoarrow-c/archive/refs/heads/main.zip#egg=geoarrow&subdirectory=python" >>> import geoarrow.pyarrow as ga
>>> import numpy as np
>>> ga.linestring().from_geobuffers(None,np.array([0, 2, 5], dtype=np.int32), np.array([1, 2, 3, 4, 5], dtype=np.double), np.array([6, 7, 8, 9, 10], dtype=np.double))
LinestringArray:LinestringType(geoarrow.linestring)[2]
<LINESTRING (1 6, 2 7)>
<LINESTRING (3 8, 4 9, 5 10)> Recent versions of |
Currently the way this works (sans any arrow) is something like:
It's unclear to me how best to plug in Arrow, since Xarray is not fundamentally a dataframe library. Presumbly Arrow could help by reducing the amount of memory copies? Or make step 4 faster? |
I'm rather new to the Python universe and so I'm not sure what the best way to plug in Arrow is either. It is true that the If the ultimate endpoint always has to be numpy array with dtype=object, implementing from_ragged_array with non-interleaved coordinates will probably be a lot simpler than invoking geoarrow/arrow. If a pandas.Series is acceptable, there may be some optimizations that are possible to avoid copies. |
AFAIK there's not much constraints on the Xarray side regarding the types of the array supported. I think any type implementing the Python (Numpy) array API could in principle be wrapped in an Xarray variable. If the goal is to use it with an Xarray index, there is even less constraint: Xarray internally relies on some adapters to reuse Pandas (multi-)index objects within both Xarray variables and indexes so that they share the same data. Likewise, it should also be possible to reuse a |
@rabernat I don't think steps 5 and 6 do any copy of the underlying shapely objects, but I might be wrong. I agree that Arrow will probably not bring a lot of added value if at some point (step 4) we need to create new shapely (GEOS) objects that can't reuse the same arrow buffers. |
For Geographic indexes built on top of S2Geometry, I think @paleolimbot has shown that it could be possible to use S2 with Arrow without copying the data. So I guess it would make more sense to plug in Arrow in this case. However, there is still a lot of work to put all the pieces together, at least on the Python side. |
I don't think so. We should ensure that the way we encode geometries follows a similar logic to what geoarrow implements but I don't think there's any reason to plug arrow in this process, when the source is NetCDF or Zarr.
That depends on what
Not an expert on pyarrow but I am afraid that shapely's implementation of GEOS bindings may not be compatible with this idea as it assumes a numpy array. But I may be wrong here. |
Do you construct the index from an array of shapely objects anyway? In which case, it seems preferable to have |
Yes, we do expect an array of shapely objects. I am fine depending on |
With I can get import pyarrow as pa
import xarray as xr
import numpy as np
def data_array_to_pyarrow(arr: xr.DataArray) -> pa.FixedShapeTensorArray:
value_type = pa.from_numpy_dtype(arr.dtype)
shape = arr.shape[1:]
size = arr.size / arr.shape[0]
arrow_type = pa.fixed_shape_tensor(value_type, shape, dim_names=arr.dims[1:])
pa_arr = pa.FixedSizeListArray.from_arrays(np.ravel(arr, order="C"), size)
return pa.ExtensionArray.from_storage(arrow_type, pa_arr) But this seems to omit:
It doesn't seem possible to round trip to round trip the other dimensions' coordinates after the first? Testing with some example data from the xvec docs: import geopandas as gpd
import numpy as np
import pandas as pd
import pyarrow as pa
import xarray as xr
import xvec
from geodatasets import get_path
counties = gpd.read_file(get_path("geoda.natregimes"))
pop1960 = xr.DataArray(counties.PO60, coords=[counties.geometry], dims=["county"])
pop1960 = pop1960.xvec.set_geom_indexes("county", crs=counties.crs)
population = xr.DataArray(
counties[["PO60", "PO70", "PO80", "PO90"]],
coords=(counties.geometry, [1960, 1970, 1980, 1990]),
dims=("county", "year"),
).xvec.set_geom_indexes("county", crs=counties.crs) the |
Interesting! If we could save geometries as GeoArrow, could we potentially load this directly to lonboard for visualisation? Assuming we resolve the coordinate issue. |
Also, this is limited to DataArray, right? So no flexibility to save a Dataset so far. |
Yeah, that's the idea.
I'm still pretty new to xarray, so still wrapping my head around some concepts, but you can have multiple |
I'm not yet very familiar with arrow, but reading your comment @kylebarron it looks like reusing FixedShapeTensorArray in Xarray (or converting it to/from xarray objects) would better be addressed at the An Xarray Variable may wrap all sorts of array-like objects beyond numpy and dask arrays. You can see for example a number of wrapper classes in https://github.com/pydata/xarray/blob/main/xarray/core/indexing.py, e.g., Xarray DataArray and Dataset are essentially collections of aligned Variable objects (a DataArray can be viewed as a special case of a Dataset with a unique data variable). Both are closer to the Mid or long-term, interoperability with Arrow is something that we'll need to address in Xarray itself if we want to keep the same level of interoperability with Pandas. |
At the moment, we have a decent set of functionality to work with in-memory vector data cubes. What we do not have is a way of saving them to a disk and reading back again, apart from using
pickle
which is not very interoperable.We should find a way of saving to a file format that allows us interoperability with @edzer's
stars
in RSpatial. @edzer, you were mentioning that NetCDF is able to handle VDC. Doesstars
implements that? Can we create a data cube indexed by geometry and save it via GDAL to NetCDF? I tried only very briefly to play with thewrite_stars
function to no avail1.The spec for NetCDF way of storing geometries is here but as someone who never worked with the file format, it is not super clear to me how it all works together.
Over in Python, there is some work @dcherian posted on conversion of shapely geometries to CF geometries and back (https://cf-xarray.readthedocs.io/en/latest/generated/cf_xarray.shapely_to_cf.html) 2. There is also https://github.com/twhiteaker/cfgeom which may potentially be used (although it is only a reference implementation that is not maintained).
So it seems that NetCDF should be able to store vector data cubes and we may just need to figure out some wrappers for a convenient IO from xvec-backed xarray objects. But I'd welcome some help or at least a guidance from someone more familiar with CF conventions.
The other option seems to be Zarr as discussed here but that is at this stage only an idea and (Geo)Zarr spec is not ready as of now.
The last option is to convert the cube to a long-form dataframe and save it as a GeoParquet but that kind of breaks the point of having a cube in the first place.
Footnotes
Getting
Error in !all.equal(match(xydims, names(d)), 1:2) : invalid argument type
when trying to usewrite_stars(st, 'test.nc')
on the example cube from https://r-spatial.github.io/stars/articles/stars1.html#vector-data-cube-example. ↩only points are implemented so far but that may change ↩
The text was updated successfully, but these errors were encountered: