Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding hazard data subsetting given asset's lat/lon pairs #263

Closed
emileten opened this issue Apr 4, 2024 · 7 comments
Closed

Comments

@emileten
Copy link

emileten commented Apr 4, 2024

Hi @joemoorhouse,

This is a question about how the code works, not a feature request or a bug report. Let me know if there is a better place for this (I didn't find a discussion section in the repo).

I've been diving in physrisk yesterday. I am trying to understand how the logic of selecting pixels in the hazard data for given assets locations provided by the client works. In other words, I am trying to find where the mapping : assets lat/lons -> hazard data pixels lives and understand it.

I believe I've identified the location : in zarr_reader, specifically in the get_curves method. The precise line of code where the spatial subsetting happens seems to be this one where we make use of the zarr-py get_coordinate_selection method. We pass index values (iz,iy,ix) corresponding to the pixels we're selecting in the data.

So, the mapping logic must happen right before. I am not able though to understand how the mapping works. Where exactly are the lat/lons mapped to the zarr indices and how does that logic work in get_curves ? I first thought it's in the call to the _get_coordinates method, but this one doesn't access the zarr store at all.

Could you shed a little bit of light on this ? Note that I may be looking at the wrong place entirely, let me know if so !

Thank you.

cc @geohacker

@tomalrussell
Copy link

Hi @emileten - I was curious about this recently too (but haven't worked directly on physrisk so Joe or colleagues might want to jump in).

The key metadata is pulled from the store a few lines earlier in get_curves:

# OSC-specific attributes contain transform and return periods
t = z.attrs["transform_mat3x3"] # type: ignore
transform = Affine(t[0], t[1], t[2], t[3], t[4], t[5])
crs = z.attrs.get("crs", "epsg:4326")

  • the Affine transform is a 3x3 matrix that transforms zarr indices to coordinates. _get_coordinates finds the inverse and uses it to transform the coordinates to zarr indices. #L348-L350
  • the crs defaults to EPSG:4326 latitude/longitude, but could be another coordinate reference system. _get_coordinates transforms the lat/lons (for example into eastings/northings in a Molleweide projection) if necessary. #L342-L346

@emileten
Copy link
Author

emileten commented Apr 4, 2024

Thanks @tomalrussell that's very helpful !

So, if I am understanding correctly, we're not storing the coordinate labels in the zarr store, but only this transform_mat3x3 matrix that is used to convert indices to coordinate labels.

My deeper question sort of remains. What actually happens if the input lat/lon wasn't in the hazard zarr store in the first place (i.e there is no exact match) ? I would have expected the behavior of the code to be equivalent to a 'nearest neighbor match' -- i.e, we pick the zarr coordinate that's the closest to the provided lat/lon.

The answer to my question, I guess, is in how this matrix is defined?

Edit: or rather, is this the role of the interpolation method ?

@joemoorhouse
Copy link
Collaborator

Hi @emileten and thanks for the help @tomalrussell!

The affine matrix lets us transform from the geographical space e.g. latitudes and longitudes to pixel space directly, so no labels needed. In general the result of the transform is fractional pixel indices so then we can decide whether to calculate the point by bilinear interpolation or take the containing pixel or other variations.

See, e.g.
https://rasterio.readthedocs.io/en/stable/topics/transforms.html

Geotiffs support affine matrices so nothing new under the sun here actually (i.e. it is pretty widely adopted). NetCDF and XArrays do have coordinate labels of course. In the hazard repo we have an OscZarr class that reads the Zarr arrays as Xarrays (e.g. for generating map tiles) - that is, we can convert between representations as required; coordinate labels can be inferred from affine matrix and array dimensions.

Xarray has its own format for saving arrays in Zarr format but there we end up with separate arrays for the coordinates. Downsides of this are performance (need always to load up coordinates before doing anything and cache that in certain circumstances) and having a nested structure. Hence we went more the Geotiff way. Zarr may set a standard for such things in the future that we would presumably want to align to but has none yet as far as I know.

Hope that helps!
Joe

@emileten
Copy link
Author

emileten commented Apr 5, 2024

Ok great I think I understand what's happening now, thanks for the additional input @joemoorhouse. So, when we try to subset a zarr to a latitude/longitude pair that don't have an exact match in the zarr coordinates (per the transform matrix definition), this lat/lon pair results in fractional pixel indices. No data point corresponds to that of course, and we have to make a decision. We allow for the following decisions :

  • interpolation="floor" : we take the 'previous' pixel data. This is the default.
  • interpolation="linear": we take the average of the surrounding pixels data (weighted by fractional distance)
  • interpolation="max": we take the max of the surrounding pixels data
  • ... "min" ... same as above but minimum

Does this sound correct?

@joemoorhouse
Copy link
Collaborator

Hi @emileten, yes that's right. It turns out that how to do linear interpolation depends on how pixels are interpreted, i.e. pixel-is-area or pixel-is-point. We assume the former (pixel-is-area) - I think that is true for all data sets we are going to need. But actually the difference is only 0.5 pixels. I mention just in case you see 'pixel-is-area' and are wondering what that means.

@emileten
Copy link
Author

emileten commented Apr 8, 2024

Thanks @joemoorhouse that helps enough I think, I'll close this conversation

@tomalrussell
Copy link

Hi both - in case of interest, there's some discussion around standardising this way of storing the affine geotransform as metadata (rather than the full coordinate dimension arrays) over at zarr-developers/geozarr-spec#17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants