-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
keep_attrs for Dataset.resample and DataArray.resample #829
Changes from all commits
e2a30b2
720cfd9
d4f9daf
d241fb9
e289e5e
e1ec92c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1407,6 +1407,32 @@ def test_resample_first(self): | |
name='time') | ||
self.assertDataArrayIdentical(expected, actual) | ||
|
||
def test_resample_first_keep_attrs(self): | ||
times = pd.date_range('2000-01-01', freq='6H', periods=10) | ||
array = DataArray(np.arange(10), [('time', times)]) | ||
array.attrs['meta'] = 'data' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you may consider using the fixtures we create here. You could get away with: array = self.dv
array.attrs = self.attrs
# later on...
assert array.attrs == self.attrs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was copying the style of the existing test_resample methods. I think a new array/dataset is created for that method because the fixtures don't include a time axis, which is important for resampling. Code duplication in tests isn't something to be feared, though. This person on stack overflow summarizes it pretty well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Readability still counts in test code. I find unit tests that copy all but a line or two from a related test difficult to distinguish between (e.g. I don't entirely buy the SO post's perspective but he does contradict your point to some extent:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, readability does count, and simple code is more readable. When I look at the line "assert array.attrs=self.attrs", my first thought is "why am I comparing an array to a Unittest.TestCase?". I have to go somewhere else in the code (to the setup function) to figure out what's going on. There are similar questions like "what is self.dv and how is it different from self.v" that would be more readable if the code was not factored out. I'm questioning "is it really OK to factor this?". We're talking about a very small difference in readability, if any, so if you like elegance I would see why you'd rather have it factored. And it's not that bad when the variables that are set up explain themselves. You can immediately distinguish between test_resample_first_keep_attrs and test_resample_mean_keep_attrs by their name. If you can't, I should make the name clearer or add a docstring. For this particular case, since you want the initialization code factored out I'll edit the fixtures to have a time axis, and then use them. I don't think it's a good idea to have two sets of fixtures (one with and one without a time axis). If adding a time axis breaks other tests I'll keep it how it is now. |
||
|
||
resampled_array = array.resample('1D', dim='time', how='first', keep_attrs=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of writing this whole method twice, why don't you use a loop? e.g.,
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a couple reasons not to - it may be unclear which "how" caused the test to fail (though that can usually be fixed with good error messages), and it will often be unclear whether multiple "how"s would cause the test to fail. For "first" and "mean" in particular, different bugs may cause different ones (or both) to fail. I'll consider this though as I clean the tests up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a fair point -- this is always a tradeoff, especially with tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wait, I thought the whole point here was that they the different methods (e.g. for keep_attrs in [True, False]:
for how in ['mean', 'max', 'min', 'std', 'first', 'last']:
resampled_array = array.resample('1D', dim='time', how=how, keep_attrs=keep_attrs)
if keep_attrs:
# check that attrs are present
else:
# check that attrs are not present There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In theory yes, but "first" and "last" are treated differently from the rest, so I tested "first" and one random other option ("mean"), and put them in two separate tests. If I ran the test you just wrote, and it failed for how="mean", keep_attrs=True, I wouldn't know anything about any of the other cases until I fixed the code for that case so that it could run the other cases. It's best practice to split different cases you want to test into separate tests. It might make sense to have one test for ("first", "last") and another for the rest, but it shouldn't be necessary to test the other cases (and you'd never get a fully exhaustive list). If you were to add any new case it should probably be a lambda function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get what you are saying but I disagree with the premise. Remember, the reason you are writing this test in the first place was that In practice, we make multiple assertions in single unit test all the time (browse the xarray code base for plenty of examples). If the first assertion fails in a test, it doesn't continue and that isn't a problem because the test failed. Lastly, no, we'll never get an exhaustive list but that shouldn't stop us from thinking about the known edge cases. The known edge case in question is that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The issue wasn't that how=first behaves differently from how=mean, the problem was that how=first and how=mean were both failing, but the causes of the failures were different because the code that handles them is different. Because the failure of one doesn't tell you whether the other will fail, they're two different tests. It makes sense to have a test that loops over how=('first', 'last') and a test that loops over ('mean', 'max', 'min', 'std'), but not a test that loops over the union. And you can have multiple assertions in one test (so long as they're all testing the same thing), but not multiple tests in one test function. If you want to know when they fail independently of each other they should be split into multiple tests. I treated the known edge case by having a test for how=first and a test for how=mean. I'm not really concerned about the treatment of other functions based on what I've seen of the code, but if you want I can add some more functions in the form of a for loop for completeness. I should at least add a case for a function argument like np.mean. |
||
actual = resampled_array.attrs | ||
expected = array.attrs | ||
self.assertEqual(expected, actual) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use |
||
|
||
resampled_array = array.resample('1D', dim='time', how='first', keep_attrs=False) | ||
assert resampled_array.attrs == {} | ||
|
||
def test_resample_mean_keep_attrs(self): | ||
times = pd.date_range('2000-01-01', freq='6H', periods=10) | ||
array = DataArray(np.arange(10), [('time', times)]) | ||
array.attrs['meta'] = 'data' | ||
|
||
resampled_array = array.resample('1D', dim='time', how='mean', keep_attrs=True) | ||
actual = resampled_array.attrs | ||
expected = array.attrs | ||
self.assertEqual(expected, actual) | ||
|
||
resampled_array = array.resample('1D', dim='time', how='mean', keep_attrs=False) | ||
assert resampled_array.attrs == {} | ||
|
||
def test_resample_skipna(self): | ||
times = pd.date_range('2000-01-01', freq='6H', periods=10) | ||
array = DataArray(np.ones(10), [('time', times)]) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1647,7 +1647,7 @@ def test_resample_and_first(self): | |
'bar': ('time', np.random.randn(10), {'meta': 'data'}), | ||
'time': times}) | ||
|
||
actual = ds.resample('1D', dim='time', how='first') | ||
actual = ds.resample('1D', dim='time', how='first', keep_attrs=True) | ||
expected = ds.isel(time=[0, 4, 8]) | ||
self.assertDatasetIdentical(expected, actual) | ||
|
||
|
@@ -1658,6 +1658,46 @@ def test_resample_and_first(self): | |
actual = ds.resample('3H', 'time', how=how) | ||
self.assertDatasetEqual(expected, actual) | ||
|
||
def test_resample_by_mean_with_keep_attrs(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have the same thoughts on these methods as the other -- you can consolidate them into a single test with a loop over |
||
times = pd.date_range('2000-01-01', freq='6H', periods=10) | ||
ds = Dataset({'foo': (['time', 'x', 'y'], np.random.randn(10, 5, 3)), | ||
'bar': ('time', np.random.randn(10), {'meta': 'data'}), | ||
'time': times}) | ||
ds.attrs['dsmeta'] = 'dsdata' | ||
|
||
resampled_ds = ds.resample('1D', dim='time', how='mean', keep_attrs=True) | ||
actual = resampled_ds['bar'].attrs | ||
expected = ds['bar'].attrs | ||
self.assertEqual(expected, actual) | ||
|
||
actual = resampled_ds.attrs | ||
expected = ds.attrs | ||
self.assertEqual(expected, actual) | ||
|
||
def test_resample_by_mean_discarding_attrs(self): | ||
times = pd.date_range('2000-01-01', freq='6H', periods=10) | ||
ds = Dataset({'foo': (['time', 'x', 'y'], np.random.randn(10, 5, 3)), | ||
'bar': ('time', np.random.randn(10), {'meta': 'data'}), | ||
'time': times}) | ||
ds.attrs['dsmeta'] = 'dsdata' | ||
|
||
resampled_ds = ds.resample('1D', dim='time', how='mean', keep_attrs=False) | ||
|
||
assert resampled_ds['bar'].attrs == {} | ||
assert resampled_ds.attrs == {} | ||
|
||
def test_resample_by_last_discarding_attrs(self): | ||
times = pd.date_range('2000-01-01', freq='6H', periods=10) | ||
ds = Dataset({'foo': (['time', 'x', 'y'], np.random.randn(10, 5, 3)), | ||
'bar': ('time', np.random.randn(10), {'meta': 'data'}), | ||
'time': times}) | ||
ds.attrs['dsmeta'] = 'dsdata' | ||
|
||
resampled_ds = ds.resample('1D', dim='time', how='last', keep_attrs=False) | ||
|
||
assert resampled_ds['bar'].attrs == {} | ||
assert resampled_ds.attrs == {} | ||
|
||
def test_to_array(self): | ||
ds = Dataset(OrderedDict([('a', 1), ('b', ('x', [1, 2, 3]))]), | ||
coords={'c': 42}, attrs={'Conventions': 'None'}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call out
first
andlast
as these specific methods?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. It was also a problem with "mean" and I don't know what other operations on DataArray objects, probably all of the numpy operations. AFAIK for Dataset the only problems were "first" and "last", but I only tested "first" and "mean" - it was a problem for "first" and not for "mean".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you'd like another message that takes that into account, let me know and I can put it in. It just gets verbose quickly to explain the difference in behavior for DataArray and Dataset objects, with different "how" options (do you call those different resampling operations?), and the fact that not all "how"s were tested.