Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep_attrs for Dataset.resample and DataArray.resample #829

Merged
merged 6 commits into from
Apr 19, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,26 @@ This release includes

Enhancements
~~~~~~~~~~~~

- DataArray and Dataset method :py:meth:`where` now supports a ``drop=True``
option that clips coordinate elements that are fully masked. By
`Phillip J. Wolfram <https://github.com/pwolfram>`_.

- DataArray and Dataset method :py:meth:`resample` now supports the
``keep_attrs=False`` option that determines whether variable and dataset
attributes are retained in the resampled object. By
`Jeremy McGibbon <https://github.com/mcgibbon>`_.

Bug fixes
~~~~~~~~~

- Attributes were being retained by default for some resampling
operations when they should not. With the ``keep_attrs=False`` option, they
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call out first and last as these specific methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. It was also a problem with "mean" and I don't know what other operations on DataArray objects, probably all of the numpy operations. AFAIK for Dataset the only problems were "first" and "last", but I only tested "first" and "mean" - it was a problem for "first" and not for "mean".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd like another message that takes that into account, let me know and I can put it in. It just gets verbose quickly to explain the difference in behavior for DataArray and Dataset objects, with different "how" options (do you call those different resampling operations?), and the fact that not all "how"s were tested.

will no longer be retained by default. This may be backwards-incompatible
with some scripts, but the attributes may be kept by adding the
``keep_attrs=True`` option. By
`Jeremy McGibbon <https://github.com/mcgibbon>`_.

.. _whats-new.0.7.2:

v0.7.2 (13 March 2016)
Expand Down
12 changes: 8 additions & 4 deletions xarray/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -374,7 +374,7 @@ def rolling(self, min_periods=None, center=False, **windows):
center=center, **windows)

def resample(self, freq, dim, how='mean', skipna=None, closed=None,
label=None, base=0):
label=None, base=0, keep_attrs=False):
"""Resample this object to a new temporal resolution.

Handles both downsampling and upsampling. Upsampling with filling is
Expand Down Expand Up @@ -418,6 +418,10 @@ def resample(self, freq, dim, how='mean', skipna=None, closed=None,
For frequencies that evenly subdivide 1 day, the "origin" of the
aggregated intervals. For example, for '24H' frequency, base could
range from 0 through 23.
keep_attrs : bool, optional
If True, the object's attributes (`attrs`) will be copied from
the original object to the new one. If False (default), the new
object will be returned without attributes.

Returns
-------
Expand All @@ -441,11 +445,11 @@ def resample(self, freq, dim, how='mean', skipna=None, closed=None,
if isinstance(how, basestring):
f = getattr(gb, how)
if how in ['first', 'last']:
result = f(skipna=skipna)
result = f(skipna=skipna, keep_attrs=keep_attrs)
else:
result = f(dim=dim.name, skipna=skipna)
result = f(dim=dim.name, skipna=skipna, keep_attrs=keep_attrs)
else:
result = gb.reduce(how, dim=dim.name)
result = gb.reduce(how, dim=dim.name, keep_attrs=keep_attrs)
result = result.rename({RESAMPLE_DIM: dim.name})
return result

Expand Down
1 change: 0 additions & 1 deletion xarray/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -320,7 +320,6 @@ def _concat_shortcut(self, applied, concat_dim, positions):
# compiled language)
stacked = Variable.concat(
applied, concat_dim, positions, shortcut=True)
stacked.attrs.update(self.obj.attrs)
result = self.obj._replace_maybe_drop_dims(stacked)
result._coords[concat_dim.name] = as_variable(concat_dim, copy=True)
return result
Expand Down
26 changes: 26 additions & 0 deletions xarray/test/test_dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -1407,6 +1407,32 @@ def test_resample_first(self):
name='time')
self.assertDataArrayIdentical(expected, actual)

def test_resample_first_keep_attrs(self):
times = pd.date_range('2000-01-01', freq='6H', periods=10)
array = DataArray(np.arange(10), [('time', times)])
array.attrs['meta'] = 'data'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may consider using the fixtures we create here. You could get away with:

array = self.dv
array.attrs = self.attrs

# later on...
assert array.attrs == self.attrs

Copy link
Contributor Author

@mcgibbon mcgibbon Apr 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was copying the style of the existing test_resample methods. I think a new array/dataset is created for that method because the fixtures don't include a time axis, which is important for resampling.

Code duplication in tests isn't something to be feared, though. This person on stack overflow summarizes it pretty well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability still counts in test code. I find unit tests that copy all but a line or two from a related test difficult to distinguish between (e.g. test_resample_first_keep_attrs vs. test_resample_mean_keep_attrs).

I don't entirely buy the SO post's perspective but he does contradict your point to some extent:

If you can factor something in test code and you're sure it's ok, then do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, readability does count, and simple code is more readable. When I look at the line "assert array.attrs=self.attrs", my first thought is "why am I comparing an array to a Unittest.TestCase?". I have to go somewhere else in the code (to the setup function) to figure out what's going on. There are similar questions like "what is self.dv and how is it different from self.v" that would be more readable if the code was not factored out. I'm questioning "is it really OK to factor this?". We're talking about a very small difference in readability, if any, so if you like elegance I would see why you'd rather have it factored. And it's not that bad when the variables that are set up explain themselves.

You can immediately distinguish between test_resample_first_keep_attrs and test_resample_mean_keep_attrs by their name. If you can't, I should make the name clearer or add a docstring.

For this particular case, since you want the initialization code factored out I'll edit the fixtures to have a time axis, and then use them. I don't think it's a good idea to have two sets of fixtures (one with and one without a time axis). If adding a time axis breaks other tests I'll keep it how it is now.


resampled_array = array.resample('1D', dim='time', how='first', keep_attrs=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of writing this whole method twice, why don't you use a loop? e.g.,

for how  in ['first', 'mean']:
   ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple reasons not to - it may be unclear which "how" caused the test to fail (though that can usually be fixed with good error messages), and it will often be unclear whether multiple "how"s would cause the test to fail. For "first" and "mean" in particular, different bugs may cause different ones (or both) to fail.

I'll consider this though as I clean the tests up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point -- this is always a tradeoff, especially with tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I thought the whole point here was that they the different methods (e.g. mean, first, max, ...) are supposed to act in the same way. I think this is a great place to loop through a series of options:

for keep_attrs in [True, False]:
    for how in ['mean', 'max', 'min', 'std', 'first', 'last']:
        resampled_array = array.resample('1D', dim='time', how=how, keep_attrs=keep_attrs)

        if keep_attrs:
            # check that attrs are present
        else:
            # check that attrs are not present

Copy link
Contributor Author

@mcgibbon mcgibbon Apr 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory yes, but "first" and "last" are treated differently from the rest, so I tested "first" and one random other option ("mean"), and put them in two separate tests. If I ran the test you just wrote, and it failed for how="mean", keep_attrs=True, I wouldn't know anything about any of the other cases until I fixed the code for that case so that it could run the other cases. It's best practice to split different cases you want to test into separate tests.

It might make sense to have one test for ("first", "last") and another for the rest, but it shouldn't be necessary to test the other cases (and you'd never get a fully exhaustive list). If you were to add any new case it should probably be a lambda function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you are saying but I disagree with the premise. Remember, the reason you are writing this test in the first place was that how = {first, last} were behaving differently. So the point here should be to make sure that they don't do that anymore. It is because they are treated differently that we should test them together. The way I see it is that this is a single case.

In practice, we make multiple assertions in single unit test all the time (browse the xarray code base for plenty of examples). If the first assertion fails in a test, it doesn't continue and that isn't a problem because the test failed.

Lastly, no, we'll never get an exhaustive list but that shouldn't stop us from thinking about the known edge cases. The known edge case in question is that how = {first, mean} treat the attributes differently. If we are concerned about the resample.reduce treatment of attributes, maybe add a how=np.mean along with the others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue wasn't that how=first behaves differently from how=mean, the problem was that how=first and how=mean were both failing, but the causes of the failures were different because the code that handles them is different. Because the failure of one doesn't tell you whether the other will fail, they're two different tests.

It makes sense to have a test that loops over how=('first', 'last') and a test that loops over ('mean', 'max', 'min', 'std'), but not a test that loops over the union. And you can have multiple assertions in one test (so long as they're all testing the same thing), but not multiple tests in one test function. If you want to know when they fail independently of each other they should be split into multiple tests.

I treated the known edge case by having a test for how=first and a test for how=mean. I'm not really concerned about the treatment of other functions based on what I've seen of the code, but if you want I can add some more functions in the form of a for loop for completeness. I should at least add a case for a function argument like np.mean.

actual = resampled_array.attrs
expected = array.attrs
self.assertEqual(expected, actual)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert expected.attrs == array.attrs or self.assertEqual(expected.attrs, array.attrs)


resampled_array = array.resample('1D', dim='time', how='first', keep_attrs=False)
assert resampled_array.attrs == {}

def test_resample_mean_keep_attrs(self):
times = pd.date_range('2000-01-01', freq='6H', periods=10)
array = DataArray(np.arange(10), [('time', times)])
array.attrs['meta'] = 'data'

resampled_array = array.resample('1D', dim='time', how='mean', keep_attrs=True)
actual = resampled_array.attrs
expected = array.attrs
self.assertEqual(expected, actual)

resampled_array = array.resample('1D', dim='time', how='mean', keep_attrs=False)
assert resampled_array.attrs == {}

def test_resample_skipna(self):
times = pd.date_range('2000-01-01', freq='6H', periods=10)
array = DataArray(np.ones(10), [('time', times)])
Expand Down
42 changes: 41 additions & 1 deletion xarray/test/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1647,7 +1647,7 @@ def test_resample_and_first(self):
'bar': ('time', np.random.randn(10), {'meta': 'data'}),
'time': times})

actual = ds.resample('1D', dim='time', how='first')
actual = ds.resample('1D', dim='time', how='first', keep_attrs=True)
expected = ds.isel(time=[0, 4, 8])
self.assertDatasetIdentical(expected, actual)

Expand All @@ -1658,6 +1658,46 @@ def test_resample_and_first(self):
actual = ds.resample('3H', 'time', how=how)
self.assertDatasetEqual(expected, actual)

def test_resample_by_mean_with_keep_attrs(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same thoughts on these methods as the other -- you can consolidate them into a single test with a loop over how

times = pd.date_range('2000-01-01', freq='6H', periods=10)
ds = Dataset({'foo': (['time', 'x', 'y'], np.random.randn(10, 5, 3)),
'bar': ('time', np.random.randn(10), {'meta': 'data'}),
'time': times})
ds.attrs['dsmeta'] = 'dsdata'

resampled_ds = ds.resample('1D', dim='time', how='mean', keep_attrs=True)
actual = resampled_ds['bar'].attrs
expected = ds['bar'].attrs
self.assertEqual(expected, actual)

actual = resampled_ds.attrs
expected = ds.attrs
self.assertEqual(expected, actual)

def test_resample_by_mean_discarding_attrs(self):
times = pd.date_range('2000-01-01', freq='6H', periods=10)
ds = Dataset({'foo': (['time', 'x', 'y'], np.random.randn(10, 5, 3)),
'bar': ('time', np.random.randn(10), {'meta': 'data'}),
'time': times})
ds.attrs['dsmeta'] = 'dsdata'

resampled_ds = ds.resample('1D', dim='time', how='mean', keep_attrs=False)

assert resampled_ds['bar'].attrs == {}
assert resampled_ds.attrs == {}

def test_resample_by_last_discarding_attrs(self):
times = pd.date_range('2000-01-01', freq='6H', periods=10)
ds = Dataset({'foo': (['time', 'x', 'y'], np.random.randn(10, 5, 3)),
'bar': ('time', np.random.randn(10), {'meta': 'data'}),
'time': times})
ds.attrs['dsmeta'] = 'dsdata'

resampled_ds = ds.resample('1D', dim='time', how='last', keep_attrs=False)

assert resampled_ds['bar'].attrs == {}
assert resampled_ds.attrs == {}

def test_to_array(self):
ds = Dataset(OrderedDict([('a', 1), ('b', ('x', [1, 2, 3]))]),
coords={'c': 42}, attrs={'Conventions': 'None'})
Expand Down