Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nd-rolling #4219

Merged
merged 27 commits into from
Aug 8, 2020
Merged

nd-rolling #4219

merged 27 commits into from
Aug 8, 2020

Conversation

fujiisoup
Copy link
Member

@fujiisoup fujiisoup commented Jul 12, 2020

  • Closes Convolution operation #4196
  • Tests added
  • Passes isort -rc . && black . && mypy . && flake8
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

I noticed that the implementation of nd-rolling is straightforward.
The core part is implemented but I am wondering what the best API is, with keeping it backward-compatible.

Obviously, it is basically should look like

da.rolling(x=3, y=3).mean()

A problem is other parameters, centers and min_periods.
In principle, they can depend on dimension.
For example, we can have center=True only for x but not for y.

So, maybe we allow dictionary for them?

da.rolling(x=3, y=3, center={'x': True, 'y': False}, min_periods={'x': 1, 'y': None}).mean()

The same thing happens for .construct method.

da.rolling(x=3, y=3).construct(x='x_window', y='y_window', stride={'x': 2, 'y': 1})

I'm afraid if this dictionary argument was a bit too redundant.

Does anyone have another idea?

@pytest.mark.parametrize("center", (True, False))
@pytest.mark.parametrize("min_periods", (None, 1))
def test_ndrolling_reduce(da, center, min_periods):
rolling_obj = da.rolling(time=3, a=2, center=center, min_periods=min_periods)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a basic test for the new nd-rolling

@max-sixty
Copy link
Collaborator

This looks very promising; I'm surprised it was possible without more code. I'm being slow, but where is the nd-rolling algo? I had thought bottleneck didn't support more than one dimension? https://bottleneck.readthedocs.io/en/latest/bottleneck.move.html, and that we'd have to implement our own in numbagg (which would be very possible)

@max-sixty
Copy link
Collaborator

max-sixty commented Jul 12, 2020

Re the API, I think the dict is probably the best option, although it does complicate as the arguments become differently typed depending on one vs multiple dimensions.

One alternative is to allow fluent args, like:

(
    da
    .rolling(x=3, center=True)
    .rolling(y=5, min_periods=2)
    .mean()
)

...but does that then seem like the second rolling is operating on the result of the first?

@fujiisoup
Copy link
Member Author

Hi @max-sixty

One alternative is to allow fluent args, like:
...but does that then seem like the second rolling is operating on the result of the first?

I couldn't think of it until just now. But yes, it sounds to me like a repeated rolling operation.

I'm being slow, but where is the nd-rolling algo? I had thought bottleneck didn't support more than one dimension?

No. With nd-rolling, we need to use numpy reductions.
Its skipna=True operation is currently slow, but it can be improved replacing nan before the stride-trick.

@fujiisoup
Copy link
Member Author

Another API concern. We now use min_periods, in which we implicitly assume one-dimension cases.

With nd-dimension, I think min_counts argument is more appropriate like bottleneck, which will limit the lower bound of the number of missing entries in the n-dimensional window.

Even if we leave it, we may disallow nd-argument of min_periods but keep it a scalar.

@pep8speaks
Copy link

pep8speaks commented Jul 12, 2020

Hello @fujiisoup! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-07 22:59:00 UTC

if hasattr(axis, "__len__"):
for ax, win, cen in zip(axis, window, center):
a = rolling_window(a, ax, win, cen, fill_value)
return a
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope it is true, though I didn't check it.
A suspicious part is da.concatenate, which I expect does not copy the original (strided-)array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that ghosting breaks my expectation. I need to update the algo here.

@dcherian
Copy link
Contributor

I think the dictionary is OK. We can allow scalars for center, stride etc. in which case the scalar applies to all dimensions. This would preserve backward compatibility.

scalar for min_counts also seems OK to me. We can update it if someone requests it.

@fujiisoup fujiisoup marked this pull request as ready for review July 14, 2020 00:26
@fujiisoup
Copy link
Member Author

I think now it is ready for review, though I'm sure tests miss a lot of edge cases.
Maybe we can fix them if pointed out.

@fujiisoup
Copy link
Member Author

A possible improvement will be nan-reduction methods for nd-rolling.
Currently, we just use numpy nan-reductions, which is memory consuming for strided arrays.

This issue can be solved by replacing nan by appropriate values and applying nonnan-reduction methods,
e.g.,

da.rolling(x=2, y=3).construct(x='xw', y='yw').sum(['xw', 'yw'])

should be the same with

da.rolling(x=2, y=3).construct(x='xw', y='yw', fill_value=0).sum(['xw', 'yw'], skipna=False)

and the latter is much more memory efficient.

I'd like to leave this improvement to future PR.

@fujiisoup
Copy link
Member Author

I got an error for typechecking, only in CI but not in local, from the code that I didn't change.

@fujiisoup
Copy link
Member Author

Could anyone kindly review this?

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, great implementation imo

@@ -802,7 +802,7 @@ def rolling(
Minimum number of observations in window required to have a value
(otherwise result is NA). The default, None, is equivalent to
setting min_periods equal to the size of the window.
center : boolean, default False
center : boolean, or a mapping, default False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types on the signature should change too (indeed, I'm surprised mypy doesn't fail)

for k in self._attributes
if getattr(self, k, None) is not None
for k in list(self.dim)
+ list(self.window)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a list of ints? Is it going to do getattr(da, 3)?

xarray/core/rolling.py Show resolved Hide resolved
a = np.pad(a, pads, mode="constant", constant_values=fill_value)
return _rolling_window(a, window, axis)
for ax, win in zip(axis, window):
a = _rolling_window(a, win, ax)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very clever! I spent a while trying to figure out how it works...

@pytest.mark.parametrize("ds", (1,), indirect=True)
@pytest.mark.parametrize("center", (True, False))
@pytest.mark.parametrize("min_periods", (None, 1))
@pytest.mark.parametrize("name", ("sum", "mean", "max"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this test will fail for something like std?
IIUC iteratively applying the function to the array for each dim only works for some functions.

Which is probably fine, no need to write more tests imo, though maybe worth making a note

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, many other functions should have the same result.
If there are some nans in random positions, the result should be different.
I added a few more tests and include function std, but I don't think it is necessary to test all nanops.

@fujiisoup
Copy link
Member Author

Thanks @max-sixty for the review ;)
I'll work for the update in a few days.

assert_allclose(actual, expected)
assert actual.dims == expected.dims

# Do it in the oposite order
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Do it in the oposite order
# Do it in the opposite order

@max-sixty
Copy link
Collaborator

I'm still a bit confused. How does the "roll over each dimension in turn" approach equal the "roll over both dimension together" approach with a function like std? Here's a proposed counter example:

import xarray as xr
import numpy as np

da = xr.DataArray(np.asarray([[0,10,0],[0,10,0], [0,10,0]]), dims=list('xy'))
print(da)

<xarray.DataArray (x: 3, y: 3)>
array([[ 0, 10,  0],
       [ 0, 10,  0],
       [ 0, 10,  0]])
Dimensions without coordinates: x, y

x_std = da.rolling(dict(x=2)).std()
print(x_std)

<xarray.DataArray (x: 3, y: 3)>
array([[nan, nan, nan],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
Dimensions without coordinates: x, y

x_then_y_std = x_std.rolling(dict(y=2)).std()
print(x_then_y_std)

<xarray.DataArray (x: 3, y: 3)>
array([[nan, nan, nan],
       [nan,  0.,  0.],
       [nan,  0.,  0.]])
Dimensions without coordinates: x, y

combined_std = da.rolling(dict(x=2, y=2)).std()
print(combined_std)

<xarray.DataArray (x: 3, y: 3)>
array([[nan, nan, nan],
       [nan,  5.,  5.],
       [nan,  5.,  5.]])
Dimensions without coordinates: x, y

@dcherian
Copy link
Contributor

dcherian commented Aug 7, 2020

I think it's more like

da.rolling(x=2).construct("xr").rolling(y=2).construct("yr").std(["xr", "yr"])

but that gives

array([[0., 5., 5.],
       [0., 5., 5.],
       [0., 5., 5.]])

@fujiisoup
Copy link
Member Author

Thanks @max-sixty .

You are completely correct.
As the test pass, I was fooling myself.

The reason was that the dataset I was using for the test does not have time and x simultaneously.
So I was not testing the 2d-rolling but just 1d-rolling.

Fixed. Now it correctly fails for mean and std, but passes with max and sum

@max-sixty
Copy link
Collaborator

Great! Thanks @fujiisoup !

Ready to go from my POV

@fujiisoup fujiisoup merged commit 1d3dee0 into pydata:master Aug 8, 2020
@fujiisoup
Copy link
Member Author

@max-sixty
thanks for the review.
merged

@fujiisoup fujiisoup deleted the nd_rolling branch August 8, 2020 07:23
@keewis keewis mentioned this pull request Aug 8, 2020
dcherian added a commit to rpgoldman/xarray that referenced this pull request Aug 14, 2020
* 'master' of github.com:pydata/xarray: (260 commits)
  Increase support window of all dependencies (pydata#4296)
  Implement interp for interpolating between chunks of data (dask) (pydata#4155)
  Add @mathause to current core developers. (pydata#4335)
  install sphinx-autosummary-accessors from conda-forge (pydata#4332)
  Use sphinx-accessors-autosummary (pydata#4323)
  ndrolling fixes (pydata#4329)
  DOC: fix typo argmin -> argmax in DataArray.argmax docstring (pydata#4327)
  pin sphinx to 3.1(pydata#4326)
  nd-rolling (pydata#4219)
  Implicit dask import 4164 (pydata#4318)
  allow customizing the inline repr of a duck array (pydata#4248)
  silence the known docs CI issues (pydata#4316)
  enh: fixed pydata#4302 (pydata#4315)
  Remove all unused and warn-raising methods from AbstractDataStore (pydata#4310)
  Fix map_blocks example (pydata#4305)
  Fix docstring for missing_dims argument to isel methods (pydata#4298)
  Support for PyCharm remote deployment (pydata#4299)
  Update map_blocks and map_overlap docstrings (pydata#4303)
  Lazily load resource files (pydata#4297)
  warn about the removal of the ufuncs (pydata#4268)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 15, 2020
* upstream/master: (34 commits)
  Fix bug in computing means of cftime.datetime arrays (pydata#4344)
  fix some str accessor inconsistencies (pydata#4339)
  pin matplotlib in ci/requirements/doc.yml (pydata#4340)
  Clarify drop_vars return value. (pydata#4244)
  Support explicitly setting a dimension order with to_dataframe() (pydata#4333)
  Increase support window of all dependencies (pydata#4296)
  Implement interp for interpolating between chunks of data (dask) (pydata#4155)
  Add @mathause to current core developers. (pydata#4335)
  install sphinx-autosummary-accessors from conda-forge (pydata#4332)
  Use sphinx-accessors-autosummary (pydata#4323)
  ndrolling fixes (pydata#4329)
  DOC: fix typo argmin -> argmax in DataArray.argmax docstring (pydata#4327)
  pin sphinx to 3.1(pydata#4326)
  nd-rolling (pydata#4219)
  Implicit dask import 4164 (pydata#4318)
  allow customizing the inline repr of a duck array (pydata#4248)
  silence the known docs CI issues (pydata#4316)
  enh: fixed pydata#4302 (pydata#4315)
  Remove all unused and warn-raising methods from AbstractDataStore (pydata#4310)
  Fix map_blocks example (pydata#4305)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 16, 2020
* upstream/master: (40 commits)
  Fix bug in computing means of cftime.datetime arrays (pydata#4344)
  fix some str accessor inconsistencies (pydata#4339)
  pin matplotlib in ci/requirements/doc.yml (pydata#4340)
  Clarify drop_vars return value. (pydata#4244)
  Support explicitly setting a dimension order with to_dataframe() (pydata#4333)
  Increase support window of all dependencies (pydata#4296)
  Implement interp for interpolating between chunks of data (dask) (pydata#4155)
  Add @mathause to current core developers. (pydata#4335)
  install sphinx-autosummary-accessors from conda-forge (pydata#4332)
  Use sphinx-accessors-autosummary (pydata#4323)
  ndrolling fixes (pydata#4329)
  DOC: fix typo argmin -> argmax in DataArray.argmax docstring (pydata#4327)
  pin sphinx to 3.1(pydata#4326)
  nd-rolling (pydata#4219)
  Implicit dask import 4164 (pydata#4318)
  allow customizing the inline repr of a duck array (pydata#4248)
  silence the known docs CI issues (pydata#4316)
  enh: fixed pydata#4302 (pydata#4315)
  Remove all unused and warn-raising methods from AbstractDataStore (pydata#4310)
  Fix map_blocks example (pydata#4305)
  ...
@max-sixty max-sixty mentioned this pull request Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convolution operation
4 participants