Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance of rolling.reduce #1831

Closed
fujiisoup opened this issue Jan 15, 2018 · 4 comments
Closed

Slow performance of rolling.reduce #1831

fujiisoup opened this issue Jan 15, 2018 · 4 comments

Comments

@fujiisoup
Copy link
Member

Code Sample, a copy-pastable example if possible

In [1]: import numpy as np
   ...: import xarray as xr
   ...: 
   ...: da = xr.DataArray(np.random.randn(1000, 100), dims=['x', 'y'],
   ...:                   coords={'x': np.arange(1000)})
   ...: 

In [2]: %%timeit
   ...: da.rolling(x=10).reduce(np.sum)
   ...: 
2.04 s ± 8.25 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Problem description

In DataArray.rolling, we index by .isel method for every window, constructing huge number of xr.DataArray instances. This is very inefficient.

Of course, we can use bottleneck methods if available, but this provides only a limited functions.
(This also limits possible extensions of rolling, such as ND-rolling (#819), window type (#1142), strides (#819).)

I am wondering if we could skip any sanity checks in our DataArray.isel -> Variable.isel path in indexing.
Or can we directly construct a single large DataArray instead of a lot of small DataArrays?

@jhamman
Copy link
Member

jhamman commented Jan 15, 2018

@fujiisoup - I think this is a great idea. As you've noted, the reduce implementation in about as naive as it comes (#668). While the __iter__ method should continue to return DataArray objects, the reduce method can be optimized.

@fujiisoup
Copy link
Member Author

I'm thinking to use the fancy indexing to speed up rolling.reduce,
by temporary constructing a single large array with da.ndim+1 dimensions,

E.g. for the following DataArray

In [2]: da
Out [2]: 
<xarray.DataArray (x: 4)>
array([0, 1, 2, 3])
Dimensions without coordinates: x

a larger array with additional dimension (x_rolling) would be created

<xarray.DataArray (x: 4, x_rolling: 3)>
array([[ nan,   0.,   1.],
       [  0.,   1.,   2.],
       [  1.,   2.,   3.],
       [  2.,   3.,  nan]])
Dimensions without coordinates: x, x_rolling

then reduce along x_rolling dimension.

The advantages would be

  • Indexing occurs only once.
  • Reducing operation can be easily vectorized.

The disadvantages would be

  • It constructs a huge array with size of (window_size - 1) * da.size, consuming a lot of memory.

I think this disadvantage would be solved if we could use np.lib.stride_trick.as_strided for np.ndarray backend case.
(how about with dask?)

@shoyer
Copy link
Member

shoyer commented Jan 16, 2018

Yes, I think the stride tricks version would be a significant improvement. See this numpy PR for discussion/examples: numpy/numpy#31

@fujiisoup
Copy link
Member Author

fujiisoup commented Jan 16, 2018

Thanks for the information.
I will look into the issue.

I think the sliding-and-stack method itself would be also handy.
I will start from this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants