keep_attrs for Dataset.resample and DataArray.resample #829

mcgibbon · 2016-04-16T23:59:35Z

Closes #825 and #828

This update might break some scripts, because of bugs that existed in the code before. resample was inconsistent as to when it would and would not keep attributes. In particular, Dataset resampling usually threw out attributes, but not when how='first' or how='last', and DataArray preserved attributes for everything I tested ('first', 'last', and 'mean').

resample now has a keep_attrs argument that is passed on to reduction operations. However, it is not being applied in all cases. Some tests have been added that do not pass.

The funcs passed in to DataArrayGroupBy.apply did not respect the keep_attrs argument, and seem to keep attributes by default. Now DataArrayGroupBy.apply does the work of clearing result.attrs if keep_attrs is False, and does nothing to the attributes otherwise. This still needs to be reflected in the docstring.

DataArrayGroupBy._concat_shortcut copied metadata from itself to the applied arrays, when it should not. Metadata copying should be, and is, handled elsewhere. With this fix, keep_attrs for resample methods is implemented and passes the written tests. No other tests are broken.

Tests have been added to make sure that keep_attrs applies to Dataset attributes as well as the attributes of variables in that Dataset. resample now has keep_attrs=False as its default regardless of the operation being completed (previously it was True for 'first' and 'last')

shoyer · 2016-04-17T02:14:58Z

xarray/test/test_dataarray.py

+
+        resampled_array = array.resample('1D', dim='time', how='first', keep_attrs=False)
+        try:
+            resampled_array.meta


Instead, use something like: assert 'meta' not in resampled_array.attrs

Another option, slightly better still: assert resampled_array.attrs == {}

shoyer · 2016-04-17T02:20:46Z

Thanks for diving into this!

I think you can write the tests a little bit more succinctly, but otherwise this looks good. Also, please add a note to the "What's New" giving yourself credit.

max-sixty · 2016-04-17T04:34:52Z

I know I'm late to the party here. Are there many cases where keep_attrs would be expected to be False? I would have thought that it's generally expected that attrs would stick around when doing operations, either as a default or without an option to drop them. (Of course, it's easy to drop them if a user explicitly sets them to {} in a separate operation)

mcgibbon · 2016-04-17T04:45:53Z

The idea is that if a dataset has an attribute, it is making a claim about that data. xarray can't guarantee that claims the attributes make about the data remain valid after operating on that data, so it shouldn't retain those attributes unless the user says it can.

I may have ceilometer data that tells me whether a cloud base is detected at any point in time, with an attribute saying that 0 means no cloud detected, another attribute saying that 1 means cloud detected, and another saying that nan means some kind of error. If I resample that data using a mean or median, those attributes are no longer valid.

Or my Dataset may have an attribute saying that it was output by a certain instrument. If I save that Dataset after doing some analysis, it may give the impression to someone reading the netCDF that they're reading unprocessed instrument data, when they aren't.

Or I may want the hourly variance of a dataset, and do dataset.resample('1H', how='var'). In this case, the units are no longer valid.

It may seem like these are edge cases, but it's better to make no claims most of the time than to make bad claims some of the time.

max-sixty · 2016-04-18T18:08:09Z

@mcgibbon That makes sense to me

mcgibbon · 2016-04-18T19:27:27Z

@shoyer I've done the clean-ups you suggested, apart from for-looping tests for the reasons I mentioned in the line note. I hope my "what's new" additions are appropriate.

shoyer · 2016-04-18T19:33:38Z

doc/whats-new.rst

+~~~~~~~~~
+
+- Attributes were being retained by default for some resampling
+  operations when they should not. With the ``keep_attrs=False`` option, they


can we call out first and last as these specific methods?

Not really. It was also a problem with "mean" and I don't know what other operations on DataArray objects, probably all of the numpy operations. AFAIK for Dataset the only problems were "first" and "last", but I only tested "first" and "mean" - it was a problem for "first" and not for "mean".

If you'd like another message that takes that into account, let me know and I can put it in. It just gets verbose quickly to explain the difference in behavior for DataArray and Dataset objects, with different "how" options (do you call those different resampling operations?), and the fact that not all "how"s were tested.

mcgibbon · 2016-04-18T23:31:26Z

Appveyor build failed for some reason when trying to set up Miniconda on Windows 32-bit with Python 2.7. The 64-bit build of Python 3.4 passed.

Downloading Miniconda-3.7.3-Windows-x86.exe from http://repo.continuum.io/miniconda/Miniconda-3.7.3-Windows-x86.exe
File saved at C:\projects\xray\Miniconda-3.7.3-Windows-x86.exe
Installing C:\projects\xray\Miniconda-3.7.3-Windows-x86.exe to C:\Python27-conda32
C:\projects\xray\Miniconda-3.7.3-Windows-x86.exe /S /D=C:\Python27-conda32
Start-Process : This command cannot be run due to the error: The specified
executable is not a valid application for this OS platform..

jhamman · 2016-04-19T03:55:15Z

xarray/test/test_dataarray.py

+    def test_resample_first_keep_attrs(self):
+        times = pd.date_range('2000-01-01', freq='6H', periods=10)
+        array = DataArray(np.arange(10), [('time', times)])
+        array.attrs['meta'] = 'data'


you may consider using the fixtures we create here. You could get away with:

array = self.dv array.attrs = self.attrs # later on... assert array.attrs == self.attrs

I was copying the style of the existing test_resample methods. I think a new array/dataset is created for that method because the fixtures don't include a time axis, which is important for resampling.

Code duplication in tests isn't something to be feared, though. This person on stack overflow summarizes it pretty well.

Readability still counts in test code. I find unit tests that copy all but a line or two from a related test difficult to distinguish between (e.g. test_resample_first_keep_attrs vs. test_resample_mean_keep_attrs).

I don't entirely buy the SO post's perspective but he does contradict your point to some extent:

If you can factor something in test code and you're sure it's ok, then do it.

Yes, readability does count, and simple code is more readable. When I look at the line "assert array.attrs=self.attrs", my first thought is "why am I comparing an array to a Unittest.TestCase?". I have to go somewhere else in the code (to the setup function) to figure out what's going on. There are similar questions like "what is self.dv and how is it different from self.v" that would be more readable if the code was not factored out. I'm questioning "is it really OK to factor this?". We're talking about a very small difference in readability, if any, so if you like elegance I would see why you'd rather have it factored. And it's not that bad when the variables that are set up explain themselves.

You can immediately distinguish between test_resample_first_keep_attrs and test_resample_mean_keep_attrs by their name. If you can't, I should make the name clearer or add a docstring.

For this particular case, since you want the initialization code factored out I'll edit the fixtures to have a time axis, and then use them. I don't think it's a good idea to have two sets of fixtures (one with and one without a time axis). If adding a time axis breaks other tests I'll keep it how it is now.

shoyer · 2016-04-19T16:06:03Z

@mcgibbon thanks.

I agree that there is a careful balance when writing test methods between maintainability and readability.

I do think fixtures that can be helpful, but we definitely do over-use them right now with our giant test cases -- those should really be cleaned up and split up. I just haven't had the energy to get around to that.

mcgibbon added 4 commits April 15, 2016 19:48

initial addition of keep_attrs argument to resample, and tests

e2a30b2

resample now has a keep_attrs argument that is passed on to reduction operations. However, it is not being applied in all cases. Some tests have been added that do not pass.

shoyer reviewed Apr 17, 2016
View reviewed changes

mcgibbon added 2 commits April 18, 2016 12:24

cleaned up tests for keep_attrs option of resample

e289e5e

added "what's new" notes for keep_attrs=False resampling option

e1ec92c

shoyer reviewed Apr 18, 2016
View reviewed changes

jhamman reviewed Apr 19, 2016
View reviewed changes

shoyer merged commit ecc4420 into pydata:master Apr 19, 2016

shoyer mentioned this pull request Apr 20, 2016

Attributes are currently kept when arrays are resampled, and not when datasets are resampled #828

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keep_attrs for Dataset.resample and DataArray.resample #829

keep_attrs for Dataset.resample and DataArray.resample #829

mcgibbon commented Apr 16, 2016

shoyer Apr 17, 2016

shoyer Apr 17, 2016

shoyer commented Apr 17, 2016

max-sixty commented Apr 17, 2016

mcgibbon commented Apr 17, 2016 •

edited

Loading

max-sixty commented Apr 18, 2016

mcgibbon commented Apr 18, 2016

shoyer Apr 18, 2016

mcgibbon Apr 18, 2016

mcgibbon Apr 18, 2016

mcgibbon commented Apr 18, 2016

jhamman Apr 19, 2016

mcgibbon Apr 19, 2016 •

edited

Loading

jhamman Apr 19, 2016

mcgibbon Apr 19, 2016

shoyer commented Apr 19, 2016

keep_attrs for Dataset.resample and DataArray.resample #829

keep_attrs for Dataset.resample and DataArray.resample #829

Conversation

mcgibbon commented Apr 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Apr 17, 2016

max-sixty commented Apr 17, 2016

mcgibbon commented Apr 17, 2016 • edited Loading

max-sixty commented Apr 18, 2016

mcgibbon commented Apr 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcgibbon commented Apr 18, 2016

Choose a reason for hiding this comment

mcgibbon Apr 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Apr 19, 2016

mcgibbon commented Apr 17, 2016 •

edited

Loading

mcgibbon Apr 19, 2016 •

edited

Loading