-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regridding API design #9
Comments
Another issue with |
You are right on with this idea. Having some sort of |
@JiaweiZhuang - I think you want to do both. A
This is a good point although we may be able to get around it by adding a kwarg or two to All in all, I think you are on the right track.
Does it always do this or is this a feature that can be controlled by xESMF? |
On this topic, we are discussing how to best implement cell distance / area / volume data (generically called "grid metrics") in xgcm/xgcm#81. One possibility is that xgcm will take care of those geometric questions. In this case, a regridding package could map between xgcm grids, rather than xarray datasets. |
I think having a step that can compute and cache the weights is a great idea, and very much like the sklearn-style approach. There are a few standard formats that I'm aware of for caching weights - in particular, SCRIP uses a format that both NCL and CDO are able to take in turn. Now, if everything is done in the framework of xgcm, and there's some notion of "standard" grids for a set of known models, then it may be possible to create a data server/archive which caches the weights for well-known re-gridding operations. Think about how Cartopy has the nifty utility to grab shapefiles from NaturalEarth... what if as part of pangeo-data there was a bucket on EC2 or GCP that catalogued and archived these re-gridding weights and could be directly downloaded by a client? The whole catalogue could be entirely automated, or could even receive requests and create the re-gridding weights on a VM that AWS or GCP spins up if the archive receives a request for an unknown re-gridding operation. |
ESMPy directly writes the weights to disk. So behind
The entire workflow is available in the notebooks in #6. I don't know if it is possible to directly retrieve weights as numpy arrays without creating a new NetCDF file (@bekozi ?). Because ESMPy calls Fortran to compute the weights, I guess simply dumping the weights to a file is easier than retrieving them as in-memory numpy arrays... Having the weights available on the disk would be helpful in general because the next time a user is very likely to remap data between the same pair of source and target grids. Instead of spending >10s recomputing them, it is much better to read a ~10MB file. The idea is similar to caching the regriddier in iris. I do have some concerns about how to keep track of those weight files, though. My plan is to follow the weight file naming in NASA GEOS5 model, so the default weight file name would be something like
where the first word is the regridding algorithm, The weight file is created when the first time a user calls A potential issue is naming conflicts. Two grids can have the same size but cover different regions. But I believe that can be solved by checking coordinate values internally and allow users to specify file name |
Yes I agree that a one-step convenience function is quite useful for occasional users. But the standard&recommended API would be two-step since that's crucial for performance. |
ESMPy does dump SCRIP-like format (#4), although currently the metadata is not complete (only weights, no grid information). I believe @rokuingh is working on that.
This is exactly how we provide weight files to GEOS-Chem HP users. But that's because our current regridding software is not easy enough for them to use, not because that's more efficient😅. Given that a ~10MB weight file takes ~10s to compute and also ~10s to download (assuming 1MB/s bandwidth), it's hard to see a speed gain by retrieving weights remotely. Indeed, there's a benefit if the entire workflow is done on EC2 or GCP. |
Hi, I have a couple comments/questions:
I'm somewhat new to this domain, so please forgive any ignorance on my part. |
It is computing the weight matrix i.e. the regridding operation itself. It involves searching and computing the overlapping areas of the two grids, handling various edge cases... A nice overview of regridding operation is available in the SCRIP userguide, Chapter 3~4 (SCRIP is a legacy regridding package, older than ESMF). Rewriting such code from scratch is beyond my capability and might take forever... |
Exactly. I think in the short term, using a temporary netcdf file will be okay, provided we can establish a road map for doing away with the weights file. That may just require working with the ESMF developers to establish a Python API for returning arrays of weights. |
The new API is implemented in the just-released v0.1.1. Let's use #11 for the weight file issue. |
I was looking at Pangeo's Regridding Design Document, which suggests an interface like
da.remap.remap_like(da_target, how='bilinear')
. It looks clean but doesn't match the "two-step" procedure for most regridding algorithms.Regridding weight calculation and weight application should be done separately, as reviewed by #2. The weights only depend on the source and target grids, not the input data. As long as the grids are not changed, users only need to calculate the weights once and can apply them to any data. Applying weights is often orders of magnitude faster than calculating weights (see the timing in #6 as an example), so separating the two steps has a huge impact on performance, especially when the task is to regrid a lot of data between a fixed pair of grids.
xESMF's current design that
dr_out = xe.regrid(ds_in, ds_out, dr_in)
basically re-computes the weights every time. Using two steps should boost the performance by 10~100x.Thus I am thinking about sklearn-like API, which is also two-step:
In sklearn you train a model by
model.fit(x_train, y_train)
then make new predictions by
y_pred = model.predict(x_test)
xESMF can do similar things:
Calculate regridding weights by
weights = xe.compute_weights(ds_in, ds_out, method='bilinear')
where
ds_in
andds_out
are xarray DataSet containing input and output grid information.then apply weights to data by
dr_out = weights.apply(dr_in)
where
dr_in
is the input DataArrayBecause ESMPy writes the weights to a file, the next time you can read it from file instead of computing it again.
weights = xe.read_weights("weights.nc")
The IO time is negligible compared to re-computing the weights.
Here
weights
is a tiny class that holds the weights and knows how to apply them to data. Alternatively, weights can be simply an xarray DataSet read byraw_weights = xr.open_dataset("weights.nc")
. Then step 2 is changed todr_out = xe.apply_weights(raw_weights, dr_in)
. I prefer the first approach because it feels more like sklearn and people might feel it more familiar.Any comments are welcome, otherwise I'll proceed this way. @rabernat @jhamman @spencerahill. Please also @ anyone who are interested in regridding with xarray.
The text was updated successfully, but these errors were encountered: