Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep_attrs for Dataset.resample and DataArray.resample #829

Merged
merged 6 commits into from
Apr 19, 2016
Merged

keep_attrs for Dataset.resample and DataArray.resample #829

merged 6 commits into from
Apr 19, 2016

Conversation

mcgibbon
Copy link
Contributor

Closes #825 and #828

This update might break some scripts, because of bugs that existed in the code before. resample was inconsistent as to when it would and would not keep attributes. In particular, Dataset resampling usually threw out attributes, but not when how='first' or how='last', and DataArray preserved attributes for everything I tested ('first', 'last', and 'mean').

resample now has a keep_attrs argument that is passed on to reduction
operations. However, it is not being applied in all cases. Some tests
have been added that do not pass.
The funcs passed in to DataArrayGroupBy.apply did not respect the
keep_attrs argument, and seem to keep attributes by default. Now
DataArrayGroupBy.apply does the work of clearing result.attrs if
keep_attrs is False, and does nothing to the attributes otherwise.

This still needs to be reflected in the docstring.
DataArrayGroupBy._concat_shortcut copied metadata from itself to the
applied arrays, when it should not. Metadata copying should be, and is,
handled elsewhere.

With this fix, keep_attrs for resample methods is implemented and passes
the written tests. No other tests are broken.
Tests have been added to make sure that keep_attrs applies to
Dataset attributes as well as the attributes of variables in that
Dataset.

resample now has keep_attrs=False as its default regardless of the
operation being completed (previously it was True for 'first' and
'last')

resampled_array = array.resample('1D', dim='time', how='first', keep_attrs=False)
try:
resampled_array.meta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, use something like: assert 'meta' not in resampled_array.attrs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option, slightly better still: assert resampled_array.attrs == {}

@shoyer
Copy link
Member

shoyer commented Apr 17, 2016

Thanks for diving into this!

I think you can write the tests a little bit more succinctly, but otherwise this looks good. Also, please add a note to the "What's New" giving yourself credit.

@max-sixty
Copy link
Collaborator

I know I'm late to the party here. Are there many cases where keep_attrs would be expected to be False? I would have thought that it's generally expected that attrs would stick around when doing operations, either as a default or without an option to drop them. (Of course, it's easy to drop them if a user explicitly sets them to {} in a separate operation)

@mcgibbon
Copy link
Contributor Author

mcgibbon commented Apr 17, 2016

The idea is that if a dataset has an attribute, it is making a claim about that data. xarray can't guarantee that claims the attributes make about the data remain valid after operating on that data, so it shouldn't retain those attributes unless the user says it can.

I may have ceilometer data that tells me whether a cloud base is detected at any point in time, with an attribute saying that 0 means no cloud detected, another attribute saying that 1 means cloud detected, and another saying that nan means some kind of error. If I resample that data using a mean or median, those attributes are no longer valid.

Or my Dataset may have an attribute saying that it was output by a certain instrument. If I save that Dataset after doing some analysis, it may give the impression to someone reading the netCDF that they're reading unprocessed instrument data, when they aren't.

Or I may want the hourly variance of a dataset, and do dataset.resample('1H', how='var'). In this case, the units are no longer valid.

It may seem like these are edge cases, but it's better to make no claims most of the time than to make bad claims some of the time.

@max-sixty
Copy link
Collaborator

@mcgibbon That makes sense to me

@mcgibbon
Copy link
Contributor Author

@shoyer I've done the clean-ups you suggested, apart from for-looping tests for the reasons I mentioned in the line note. I hope my "what's new" additions are appropriate.

~~~~~~~~~

- Attributes were being retained by default for some resampling
operations when they should not. With the ``keep_attrs=False`` option, they
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call out first and last as these specific methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. It was also a problem with "mean" and I don't know what other operations on DataArray objects, probably all of the numpy operations. AFAIK for Dataset the only problems were "first" and "last", but I only tested "first" and "mean" - it was a problem for "first" and not for "mean".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd like another message that takes that into account, let me know and I can put it in. It just gets verbose quickly to explain the difference in behavior for DataArray and Dataset objects, with different "how" options (do you call those different resampling operations?), and the fact that not all "how"s were tested.

@mcgibbon
Copy link
Contributor Author

Appveyor build failed for some reason when trying to set up Miniconda on Windows 32-bit with Python 2.7. The 64-bit build of Python 3.4 passed.

Downloading Miniconda-3.7.3-Windows-x86.exe from http://repo.continuum.io/miniconda/Miniconda-3.7.3-Windows-x86.exe
File saved at C:\projects\xray\Miniconda-3.7.3-Windows-x86.exe
Installing C:\projects\xray\Miniconda-3.7.3-Windows-x86.exe to C:\Python27-conda32
C:\projects\xray\Miniconda-3.7.3-Windows-x86.exe /S /D=C:\Python27-conda32
Start-Process : This command cannot be run due to the error: The specified
executable is not a valid application for this OS platform..

def test_resample_first_keep_attrs(self):
times = pd.date_range('2000-01-01', freq='6H', periods=10)
array = DataArray(np.arange(10), [('time', times)])
array.attrs['meta'] = 'data'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may consider using the fixtures we create here. You could get away with:

array = self.dv
array.attrs = self.attrs

# later on...
assert array.attrs == self.attrs

Copy link
Contributor Author

@mcgibbon mcgibbon Apr 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was copying the style of the existing test_resample methods. I think a new array/dataset is created for that method because the fixtures don't include a time axis, which is important for resampling.

Code duplication in tests isn't something to be feared, though. This person on stack overflow summarizes it pretty well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability still counts in test code. I find unit tests that copy all but a line or two from a related test difficult to distinguish between (e.g. test_resample_first_keep_attrs vs. test_resample_mean_keep_attrs).

I don't entirely buy the SO post's perspective but he does contradict your point to some extent:

If you can factor something in test code and you're sure it's ok, then do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, readability does count, and simple code is more readable. When I look at the line "assert array.attrs=self.attrs", my first thought is "why am I comparing an array to a Unittest.TestCase?". I have to go somewhere else in the code (to the setup function) to figure out what's going on. There are similar questions like "what is self.dv and how is it different from self.v" that would be more readable if the code was not factored out. I'm questioning "is it really OK to factor this?". We're talking about a very small difference in readability, if any, so if you like elegance I would see why you'd rather have it factored. And it's not that bad when the variables that are set up explain themselves.

You can immediately distinguish between test_resample_first_keep_attrs and test_resample_mean_keep_attrs by their name. If you can't, I should make the name clearer or add a docstring.

For this particular case, since you want the initialization code factored out I'll edit the fixtures to have a time axis, and then use them. I don't think it's a good idea to have two sets of fixtures (one with and one without a time axis). If adding a time axis breaks other tests I'll keep it how it is now.

@shoyer shoyer merged commit ecc4420 into pydata:master Apr 19, 2016
@shoyer
Copy link
Member

shoyer commented Apr 19, 2016

@mcgibbon thanks.

I agree that there is a careful balance when writing test methods between maintainability and readability.

I do think fixtures that can be helpful, but we definitely do over-use them right now with our giant test cases -- those should really be cleaned up and split up. I just haven't had the energy to get around to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants