Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploring the Foundations and Goals of the GeoZarr Format #3

Closed
christophenoel opened this issue Jan 21, 2023 · 6 comments
Closed

Exploring the Foundations and Goals of the GeoZarr Format #3

christophenoel opened this issue Jan 21, 2023 · 6 comments

Comments

@christophenoel
Copy link

christophenoel commented Jan 21, 2023

Here are a series of ideas to initiate a discussion on the foundations of the GeoZarr format and to exchange on what its goals should be.

GeoZarr was based on three fundamental principles:

  1. Provide cloud-native (optimised) access (i.e. HTTP API which does not required an intermediate service)
  2. Support multidimensional data (hyperspectral, altitude, etc.)
  3. Provide valuable geospatial data description (not restricted to 2D !)

In my opinion, this implies certain assumptions:

  • NetCDF already has its NCZarr project which did not met our concerns. GeoZarr reuses the CF conventions to describe the data, but does not pursue the same goals (and aims to be certainly simpler).
  • 2D raster already have Cloud-Optimised GeoTiff (GeoZarr must address multidimensional aspects)
  • GeoZarr should provide guidance (at least) for typical geospatial data types: multispectral, hyperspectral, ARD, UAV data, etc.
  • GeoZarr should address fundamental aspects for use on S3 storage (e.g. rechunking).
  • In the case of fine-grained properties, GeoZarr should define "Requirements Classes" in order to specify compatibility levels
  • GeoZarr should address known needs/requested features such as symbology, multi-scales, etc.

Regards,

Christophe

@benbovy
Copy link

benbovy commented Jan 23, 2023

Great to see discussions and ideas about a GeoZarr format happening here!

I landed here from this thread https://twitter.com/EvenRouault/status/1614053240508936192 about the CRS (grid mapping vs. WKT/PROJJSON), and I was wondering what is the scope of GeoZarr: is is specific to gridded (raster) geospatial data or does it aim at covering all kinds of geospatial datacubes?

Besides the data types mentioned in the current draft, another one is vector datacubes, although applications are rather limited compared to gridded datasets and I'm not sure at all what would be the best format to store vector datacubes (use arrow/parquet - https://github.com/geoarrow/geoarrow - with flattened data? create a zarr codec for geometry coordinates?).

More context on vector datacubes:

cc @edzer @martinfleis

@christophenoel
Copy link
Author

christophenoel commented Jan 23, 2023

I was wondering what is the scope of GeoZarr: is is specific to gridded (raster) geospatial data or does it aim at covering all kinds of geospatial datacubes?

Hi @benbovy ! All doors are open, but the underlying Zarr is limited to multidimensional arrays. So all kind of data that might be provided as a n-D array.

@edzer
Copy link

edzer commented Jan 23, 2023

So all kind of data that might be provided as a n-D array.

I guess you mean "as a collection of n-D arrays"?

Sectrion 7.5 of the CF conventions points out vector geometries (points, lines, polygons) can be associated with a dimension of a data cube.

@christophenoel
Copy link
Author

I guess a collection of n-D arrays is a n+1-D array. :)

@christophenoel
Copy link
Author

christophenoel commented Feb 15, 2023

One of the key objective GeoZarr is to provide a standard format for all kind of EO multi-dimensional data. This requires to define convention for at least the following aspects:

  • How to identify (semantically) the various dimensions of the arrays (CF standard names might help)
  • How to describe/access multiple related variables, with heterogeneous coordinates (e.g. children Datasets)
  • How to describe/access multiple resolutions of the data (multiscales draft may help )
  • How to encode/describe for optimised Map Tiling support
  • How to describe/access subsets only available in some resolutions (e.g. an index of the dimensions / resolution)
  • How to describe/access multiple projections (index ?)
  • How to describe/access multiple dimensional optimisations (rechunking)
  • How to describe/access typical EO products (e.g. multispectral band recommended as a dimension of the array)
  • How to describe/access time series that have not been normalized (e.g. footprints no aligned)
  • How to describe/access symbology of the corresponding data

@dblodgett-usgs
Copy link

@christophenoel can you expand on:

NetCDF already has its NCZarr project which did not meet our concerns. GeoZarr reuses the CF conventions to describe the data, but does not pursue the same goals (and aims to be certainly simpler).

Are there specific NCZarr details you can point to? We were discussing NCZarr on the call just now and wanted to understand better how the current GeoZarr spec relates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants