-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New deep copy behavior in 2022.9.0 causes maximum recursion error #7111
Comments
CC @headtr1ck any idea if this is supposed to work with your new #7089? |
I get a similar error for different structures and if I do something like
|
I basically copied the behavior of I would claim that the new behavior is correct, but maybe other devs can confirm this. Coming from netCDF, it does not really make sense to put complex objects in attrs, but I guess for in-memory only it works. |
I'd have to check, but this structure I think was originally produce by xarray reading a CF compliant NetCDF file. That is my memory at least. It could be that our library (satpy) is doing this as a convenience, replacing the name of an ancillary variable with the DataArray of that ancillary variable. My other new issue seems to be related to |
Hmmm, python seems to deal with this reasonably for its builtins: In [1]: a = [1]
In [2]: b = [a]
In [3]: a.append(b)
In [4]: import copy
In [5]: copy.deepcopy(a)
Out[5]: [1, [[...]]] I doubt this is getting hit that much given it requires a recursive data structure, but it does seem like a gnarly error. Is there some feature that python uses to check whether a data structure is recursive when it's copying, which we're not taking advantage of? I can look more later. |
yes, |
I think our implementations of This will lead to a bit of duplicate code between |
To avoid code duplication you may consider moving all logic from the |
I will set up a PR for that. |
Ok, even |
I'm a little torn on this. Obviously I'm not an xarray maintainer so I'm not the one who would have to maintain it or answer support questions about it. We actually had the user-side of this discussion in the Satpy library group a while ago which is leading to this whole problem for us now. In Satpy we don't typically use or deal with xarray Datasets (the new DataTree library is likely what we'll move to) so when we have relationships between DataArrays we'll use something like ancillary variables to connect them. For example, a data quality flag that is used by the other variables in a file. Our users don't usually care about the DQF but we don't want to stop them from being able to easily access it. I was never a huge fan of putting a DataArray in the attrs of another DataArray, but nothing seemed to disallow it so I ultimately lost that argument. So on one hand I agree it seems like there shouldn't be a need in most cases to have a DataArray inside a DataArray, especially a circular dependency. On the other hand, I'm not looking forward to the updates I'll need to make to Satpy to fix this. Note, we don't do this everywhere in Satpy, just something we use for a few formats we read. |
Also note the other important change in this new behavior which is that dask arrays are now copied ( |
I added a PR that fixes the broken reprs and deepcopys. |
It looks like that PR fixes all of my Satpy unit tests. I'm not sure how that is possible if it doesn't also change when dask arrays are copied. |
Sorry, false alarm. I was running with an old environment. With this new PR it seems the Edit: I hacked
and that fixed a lot of my dask related tests, but also seems to have introduced two new failures from what I can tell. So 🤷♂️ |
Out of curiosity, why do you need to store a DataArray object as opposed to merely the values in one? |
@TomNicholas Do you mean the "name" of the sub-DataArray? Or the numpy/dask array of the sub-DataArray? This is what I was trying to describe in #7111 (comment). In Satpy we have our own Dataset-like/DataTree-like object where the user explicitly says "I want to load X from input files". As a convenience we put any ancillary variables (ex. data quality flags) in the DataArray @mraspaud was one of the original people who proposed our current design so maybe he can provide more context. |
The ancillary variables stuff doesn't really fit the DataArray data model, so you have to do something. Here's an example with |
@dcherian Thanks for the feedback. When these decisions were made in Satpy xarray was not able to contain dask arrays as coordinates and we depend heavily on dask for our use cases. Putting some of these datasets as Note that ancillary_variables are not the only case of "embedded" DataArrays in our code. We also needed something for CRS + bounds or other geolocation information. As you know I'm very much interested in CRS and geolocation handling in xarray, but for backwards compatibility we also have pyresample AreaDefinition and SwathDefinition objects in our DataArray We have a monthly Pytroll/Satpy meeting tomorrow so if you have any other suggestions or points for or against our usage please comment here and we'll see what we can do. |
Thanks for pinging me. Regarding the ancillary variables, this comes from the CF conventions, allowing to "link" two or more arrays together. For example, we might have a |
I think the behavior of deepcopy in #7112 is correct. memo = {id(da.attrs["ancillary_variables"]): da.attrs["ancillary_variables"]}
da_new = deepcopy(da, memo) (untested!) |
@mraspaud See the cf-xarray link from Deepak. We could make them coordinates. Or we could reference them by name:
Edit: Let's talk more in the pytroll meeting today. |
We talked about this today in our pytroll/satpy meeting. We're not sure we agree with cf-xarray putting ancillary variables as coordinates or that it will work for us, so we think we could eventually remove any "automatic" ancillary variable loading and require that the user explicitly request any ancillary variables they want from Satpy's readers. That said, this will take a lot of work to change. Since it seems like #7112 fixes a majority of our issues I'm hoping that that can still be merged. I'd hope that the Side note: I feel like there is a difference between the NetCDF model and serializing/saving to a NetCDF file. |
What happened?
I have a case where a Dataset to be written to a NetCDF file has "ancillary_variables" that have a circular dependence. For example, variable A has
.attrs["ancillary_variables"]
that contains variable B, and B has.attrs["ancillary_variables"]
that contains A.What did you expect to happen?
Circular dependencies are detected and avoided. No maximum recursion error.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
I have at least one other issue related to the new xarray release but I'm still tracking it down. I think it is also related to the deep copy behavior change which was merged a day before the release so our CI didn't have time to test the "unstable" version of xarray.
Environment
The text was updated successfully, but these errors were encountered: