-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keep_attrs for Dataset.resample and DataArray.resample #829
Conversation
resample now has a keep_attrs argument that is passed on to reduction operations. However, it is not being applied in all cases. Some tests have been added that do not pass.
The funcs passed in to DataArrayGroupBy.apply did not respect the keep_attrs argument, and seem to keep attributes by default. Now DataArrayGroupBy.apply does the work of clearing result.attrs if keep_attrs is False, and does nothing to the attributes otherwise. This still needs to be reflected in the docstring.
DataArrayGroupBy._concat_shortcut copied metadata from itself to the applied arrays, when it should not. Metadata copying should be, and is, handled elsewhere. With this fix, keep_attrs for resample methods is implemented and passes the written tests. No other tests are broken.
Tests have been added to make sure that keep_attrs applies to Dataset attributes as well as the attributes of variables in that Dataset. resample now has keep_attrs=False as its default regardless of the operation being completed (previously it was True for 'first' and 'last')
|
||
resampled_array = array.resample('1D', dim='time', how='first', keep_attrs=False) | ||
try: | ||
resampled_array.meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, use something like: assert 'meta' not in resampled_array.attrs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option, slightly better still: assert resampled_array.attrs == {}
Thanks for diving into this! I think you can write the tests a little bit more succinctly, but otherwise this looks good. Also, please add a note to the "What's New" giving yourself credit. |
I know I'm late to the party here. Are there many cases where |
The idea is that if a dataset has an attribute, it is making a claim about that data. xarray can't guarantee that claims the attributes make about the data remain valid after operating on that data, so it shouldn't retain those attributes unless the user says it can. I may have ceilometer data that tells me whether a cloud base is detected at any point in time, with an attribute saying that 0 means no cloud detected, another attribute saying that 1 means cloud detected, and another saying that nan means some kind of error. If I resample that data using a mean or median, those attributes are no longer valid. Or my Dataset may have an attribute saying that it was output by a certain instrument. If I save that Dataset after doing some analysis, it may give the impression to someone reading the netCDF that they're reading unprocessed instrument data, when they aren't. Or I may want the hourly variance of a dataset, and do dataset.resample('1H', how='var'). In this case, the units are no longer valid. It may seem like these are edge cases, but it's better to make no claims most of the time than to make bad claims some of the time. |
@mcgibbon That makes sense to me |
@shoyer I've done the clean-ups you suggested, apart from for-looping tests for the reasons I mentioned in the line note. I hope my "what's new" additions are appropriate. |
~~~~~~~~~ | ||
|
||
- Attributes were being retained by default for some resampling | ||
operations when they should not. With the ``keep_attrs=False`` option, they |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call out first
and last
as these specific methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. It was also a problem with "mean" and I don't know what other operations on DataArray objects, probably all of the numpy operations. AFAIK for Dataset the only problems were "first" and "last", but I only tested "first" and "mean" - it was a problem for "first" and not for "mean".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you'd like another message that takes that into account, let me know and I can put it in. It just gets verbose quickly to explain the difference in behavior for DataArray and Dataset objects, with different "how" options (do you call those different resampling operations?), and the fact that not all "how"s were tested.
Appveyor build failed for some reason when trying to set up Miniconda on Windows 32-bit with Python 2.7. The 64-bit build of Python 3.4 passed.
|
def test_resample_first_keep_attrs(self): | ||
times = pd.date_range('2000-01-01', freq='6H', periods=10) | ||
array = DataArray(np.arange(10), [('time', times)]) | ||
array.attrs['meta'] = 'data' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may consider using the fixtures we create here. You could get away with:
array = self.dv
array.attrs = self.attrs
# later on...
assert array.attrs == self.attrs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was copying the style of the existing test_resample methods. I think a new array/dataset is created for that method because the fixtures don't include a time axis, which is important for resampling.
Code duplication in tests isn't something to be feared, though. This person on stack overflow summarizes it pretty well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability still counts in test code. I find unit tests that copy all but a line or two from a related test difficult to distinguish between (e.g. test_resample_first_keep_attrs
vs. test_resample_mean_keep_attrs
).
I don't entirely buy the SO post's perspective but he does contradict your point to some extent:
If you can factor something in test code and you're sure it's ok, then do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, readability does count, and simple code is more readable. When I look at the line "assert array.attrs=self.attrs", my first thought is "why am I comparing an array to a Unittest.TestCase?". I have to go somewhere else in the code (to the setup function) to figure out what's going on. There are similar questions like "what is self.dv and how is it different from self.v" that would be more readable if the code was not factored out. I'm questioning "is it really OK to factor this?". We're talking about a very small difference in readability, if any, so if you like elegance I would see why you'd rather have it factored. And it's not that bad when the variables that are set up explain themselves.
You can immediately distinguish between test_resample_first_keep_attrs and test_resample_mean_keep_attrs by their name. If you can't, I should make the name clearer or add a docstring.
For this particular case, since you want the initialization code factored out I'll edit the fixtures to have a time axis, and then use them. I don't think it's a good idea to have two sets of fixtures (one with and one without a time axis). If adding a time axis breaks other tests I'll keep it how it is now.
@mcgibbon thanks. I agree that there is a careful balance when writing test methods between maintainability and readability. I do think fixtures that can be helpful, but we definitely do over-use them right now with our giant test cases -- those should really be cleaned up and split up. I just haven't had the energy to get around to that. |
Closes #825 and #828
This update might break some scripts, because of bugs that existed in the code before. resample was inconsistent as to when it would and would not keep attributes. In particular, Dataset resampling usually threw out attributes, but not when how='first' or how='last', and DataArray preserved attributes for everything I tested ('first', 'last', and 'mean').