Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataArrayRolling.mean() ignores skipna=True kwarg #6772

Open
4 tasks done
corakingdon opened this issue Jul 11, 2022 · 6 comments
Open
4 tasks done

DataArrayRolling.mean() ignores skipna=True kwarg #6772

corakingdon opened this issue Jul 11, 2022 · 6 comments

Comments

@corakingdon
Copy link

corakingdon commented Jul 11, 2022

What happened?

In the following example, it appears that the skipna=True argument is being ignored, unless you call construct before taking the mean.

What did you expect to happen?

I expected:

da.rolling(time=3).mean("time", skipna=True) == da.rolling(time=3).construct("window").mean("window", skipna=True)

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

da = xr.DataArray([0, 1, 2, np.NAN, 4, 5, 6], dims="time")

# Results 1 and 2 produce the same array, without skipping NANs
result1 = da.rolling(time=3).mean("time", skipna=True)
result2 = da.rolling(time=3).mean("time", skipna=False)

# Results 3 and 4 produce the expected (different) arrays
result3 = da.rolling(time=3).construct("window").mean("window", skipna=True)
result4 = da.rolling(time=3).construct("window").mean("window", skipna=False)

print(result1)
print(result2)
print(result3)
print(result4)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

<xarray.DataArray (time: 7)>
array([nan, nan,  1., nan, nan, nan,  5.])
Dimensions without coordinates: time
<xarray.DataArray (time: 7)>
array([nan, nan,  1., nan, nan, nan,  5.])
Dimensions without coordinates: time
<xarray.DataArray (time: 7)>
array([0. , 0.5, 1. , 1.5, 3. , 4.5, 5. ])
Dimensions without coordinates: time
<xarray.DataArray (time: 7)>
array([nan, nan,  1., nan, nan, nan,  5.])
Dimensions without coordinates: time

Anything else we need to know?

Note: the first two NaNs are expected due to the min_periods argument to rolling(), but the middle NaNs are what are unexpected when skipna = True

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:50:36) [MSC v.1929 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: AMD64 Family 23 Model 17 Stepping 0, AuthenticAMD
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('English_United States', '1252')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 0.19.0
pandas: 1.3.4
numpy: 1.21.2
scipy: 1.7.1
netCDF4: 1.6.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.12.0
cftime: 1.6.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: None
dask: 2022.6.1
distributed: 2022.6.1
matplotlib: 3.4.3
cartopy: 0.20.1
seaborn: None
numbagg: None
pint: 0.19.2
setuptools: 59.8.0
pip: 22.1.2
conda: None
pytest: 7.1.2
IPython: 8.4.0
sphinx: None

@corakingdon corakingdon added bug needs triage Issue that has not been reviewed by xarray team member labels Jul 11, 2022
@corakingdon corakingdon reopened this Jul 11, 2022
@andersy005 andersy005 added topic-rolling and removed needs triage Issue that has not been reviewed by xarray team member labels Jul 11, 2022
@keewis
Copy link
Collaborator

keewis commented Feb 21, 2023

Apologies for letting this sit for so long.

The reason for the unexpected behavior seems to be that mean is implemented using sum / count:

mean = _reduce_method("mean", None, _mean)

result = self.sum(keep_attrs=False, **kwargs) / self.count(
keep_attrs=False
).astype(self.obj.dtype, copy=False)

where min_periods is applied in the sum (by masking values where count < min_periods).

However, sum on rolling objects will fill any missing values with 0 before doing anything else, so when the actual sum is computed skipna does not have any effect.

So if you were to set min_periods=1 you'd get the same result as what you'd expect while min_periods=3 is what you're seeing.

@pydata/xarray, any idea what to do here? Should we document that passing skipna does not have any effect on rolling window sums / means, or would it be better to change the implementation? Or maybe I'm missing something else?

@dcherian
Copy link
Contributor

dcherian commented Feb 21, 2023

I think the reason it's like this is the conflict min_periods vs skipna issue.

We could just raise and ask the user to provide an appropriate min_periods (min_periods=window for skipna=False and min_periods=1 for skipna=True)

@lukasbrunner
Copy link

I hope its okay to comment this here, but it seems to be a related issue. I'm working with .construct (without directing reducing again) quite a bit and find the behavior confusing:

da.rolling(time=3).construct("window")

gives

array([[nan, nan,  0.],
       [nan,  0.,  1.],
       [ 0.,  1.,  2.],
       [ 1.,  2., nan],
       [ 2., nan,  4.],
       [nan,  4.,  5.],
       [ 4.,  5.,  6.]])

but what I'd expect is rather

array([[nan, nan,  nan],
       [nan,  nan,  nan],
       [ 0.,  1.,  2.],
       [ nan,  nan, nan],
       [ nan, nan,  nan],
       [nan,  nan,  nan],
       [ 4.,  5.,  6.]])

since min_periods should be equal to the window size by default?

@dcherian
Copy link
Contributor

construct does not apply the mask. We should clarify the docs to say so.

@lukasbrunner
Copy link

From a user perspective I'd prefer if it does as I can still set min_periods=1 if I don't want that or do I overlook something? Its quite cumbersome to do it afterwards.

@dcherian
Copy link
Contributor

Now I remember why we can't do this. The output of .construct is a higher-dimensional and much bigger array than the input. This is cheap with numpy because it's a view. If we apply the mask, it will copy this big array and use up a much larger amount of memory. It's a lot better to run the operation you need and filter the end result, which is what we do for internal ops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants