Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error using reduce(): percentile() got an unexpected keyword argument 'axis' #3701

Closed
mikebyrne6 opened this issue Jan 17, 2020 · 12 comments
Closed

Comments

@mikebyrne6
Copy link

MCVE Code Sample

import os
import numpy as np
import pandas as pd
import xarray as xr
import gcsfs

# Search for available CMIP6 data:
df = pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv')

# and daily 'tas' data:
df_daily_tas  = df[(df.table_id == 'day') & (df.variable_id == 'tas')]

# get data for a single model and experiment:
source_id = 'GFDL-CM4'
expt_id = 'historical'

# define a func to load 'tas' data
def load_tas_data(source_id, expt_id):
    """
    Load daily tas data for given source and expt ids
    """
    uri = df_daily_tas[(df_daily_tas.source_id == source_id) &
                         (df_daily_tas.experiment_id == expt_id)].zstore.values[0]
    
    gcs = gcsfs.GCSFileSystem(token='anon')
    ds = xr.open_zarr(gcs.get_mapper(uri), consolidated=True)
    return ds

# Specify the percentile to look at:
percentile = 99

# Load the data:
ds = load_tas_data(source_id, expt_id).sel(time=slice('2000', '2000'))

# Get the percentiles at each lat/lon:
percentiles = ds.reduce(np.percentile, q=percentile, dim='time')

Code above gives following error:
TypeError: percentile() got an unexpected keyword argument 'axis'

Expected Output

Expected code to return percentiles of daily temperature for each latitude and longitude.

Problem Description

Code works fine on a different system with the following modules installed:

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] python-bits: 64 OS: Darwin OS-release: 14.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.1 libnetcdf: 4.4.1.1

xarray: 0.14.1
pandas: 0.25.3
numpy: 1.16.2
scipy: 1.2.1
netCDF4: 1.3.1
pydap: installed
h5netcdf: None
h5py: 2.7.1
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 0.16.1
distributed: 1.20.2
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: 0.8.1
numbagg: None
setuptools: 38.4.0
pip: 9.0.1
conda: 4.4.10
pytest: 3.3.2
IPython: 6.2.1
sphinx: 1.6.6

Output of xr.show_versions()

Issue occurs on system with the following installed modules:

INSTALLED VERSIONS ------------------ commit: None python: 3.7.6 | packaged by conda-forge | (default, Jan 7 2020, 22:33:48) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.19.76+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2

xarray: 0.14.1
pandas: 0.25.3
numpy: 1.17.3
scipy: 1.3.2
netCDF4: 1.5.1.2
pydap: installed
h5netcdf: 0.7.4
h5py: 2.10.0
Nio: None
zarr: 2.3.2
cftime: 1.0.4.2
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.0.25
cfgrib: None
iris: 2.2.0
bottleneck: 1.3.1
dask: 2.9.0
distributed: 2.9.0
matplotlib: 3.1.2
cartopy: 0.17.0
seaborn: 0.9.0
numbagg: None
setuptools: 44.0.0.post20200102
pip: 19.3.1
conda: None
pytest: 5.3.1
IPython: 7.11.1
sphinx: None

@keewis
Copy link
Collaborator

keewis commented Jan 17, 2020

the reason for this is that the percentile implementation of dask arrays does not support axis. I think you can achieve the same thing by either using compute() (which is probably not desirable) or by using quantile:

In [13]: # Get the percentiles at each lat/lon: 
    ...: ds = ds.compute() 
    ...: percentiles = ds.reduce(np.percentile, q=percentile, dim='time') 
    ...: quantiles = ds.quantile(q=percentile / 100, dim="time") 
    ...: xr.testing.assert_identical(percentiles.drop_vars("height"), quantiles.drop_vars("quantile"))

the only difference is that the unused coordinate height disappears and a quantile dimension is added.

@mikebyrne6
Copy link
Author

Thanks for that but ideally I wouldn't use compute(). The code worked using a previous version of dask (0.16.1) so I'm hoping there might be an easy fix for dask 2.9.0...

@keewis
Copy link
Collaborator

keewis commented Jan 17, 2020

you should be able to just use quantile. I called compute to demonstrate it returns the same values as np.percentile, but you don't need it with quantile.

@mikebyrne6
Copy link
Author

Seems like I do need it:

percentiles = ds.quantile(q=percentile / 100, dim="time")

returns

TypeError: quantile does not work for arrays stored as dask arrays. Load the data via .compute() or .load() prior to calling this method.

@keewis
Copy link
Collaborator

keewis commented Jan 17, 2020

sorry, my mistake. I've been testing on master which has #3559 merged and thus works with dask arrays. Unless someone has a better idea, I'd suggest copying that approach for now: using apply_ufunc to apply np.nanpercentile. Ideally, that would change the quantile call to

percentiles = quantile(ds, q=percentile / 100, dim="time")

I'll try to write a working quantile function.
Edit: see below, with that you should be able to use

ds.quantile(...)

@keewis
Copy link
Collaborator

keewis commented Jan 17, 2020

This

def quantile(self, q, dim=None, interpolation="linear", keep_attrs=None): 
    from xarray.core.computation import apply_ufunc 
    from xarray.core import utils 

    scalar = utils.is_scalar(q) 
    q = np.atleast_1d(np.asarray(q, dtype=np.float64)) 
    if dim is None: 
        dim = self.dims 

    if utils.is_scalar(dim): 
        dim = [dim] 

    def _wrapper(npa, **kwargs): 
        # move quantile axis to end. required for apply_ufunc 
        return np.moveaxis(np.nanpercentile(npa, **kwargs), 0, -1) 

    axis = np.arange(-1, -1 * len(dim) - 1, -1) 
    result = apply_ufunc( 
        _wrapper, 
        self, 
        input_core_dims=[dim], 
        exclude_dims=set(dim), 
        output_core_dims=[["quantile"]], 
        output_dtypes=[np.float64], 
        output_sizes={"quantile": len(q)}, 
        dask="parallelized", 
        kwargs={"q": q * 100, "axis": axis, "interpolation": interpolation}, 
    ) 

    # for backward compatibility 
    result = result.transpose("quantile", ...) 
    if scalar: 
        result = result.squeeze("quantile") 
    if keep_attrs: 
        result.attrs = self._attrs 
 
    return result 

xr.Variable.quantile = quantile 

seems to make quantile work, but it might be a bad idea to monkeypatch. Thoughts, @dcherian?

@mikebyrne6
Copy link
Author

Thanks a lot for this!

@dcherian
Copy link
Contributor

seems to make quantile work, but it might be a bad idea to monkeypatch. T

Personally I would add this quantile function and then do quantile(ds, q=0.99, dim="SOMETHING"). We should issue a release soon though.

@keewis
Copy link
Collaborator

keewis commented Jan 17, 2020

that would work if we would be working with variables, but Dataset.quantile does a lot more that I didn't want to copy.

We should issue a release soon though.

I agree

@mikebyrne6
Copy link
Author

mikebyrne6 commented Jan 17, 2020

keewis' quantile function works fine on 1 year of data (as in the MVCE code sample) but does not like working on 10 years of data:

ValueError: dimension 'time' on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, rechunk into a single dask array chunk along this dimension, i.e., ``.chunk({'time': -1})``, but beware that this may significantly increase memory usage.

@dcherian
Copy link
Contributor

yes. you'll need to have 1 chunk along the dimension being reduced (here 'time').

The reason is that there is no approximate quantile algorithm implemented in dask. So this implementation is just calling np.nanpercentile on each block of the array. If you're reducing along time, each block must have the entire time axis for this to work.

@keewis
Copy link
Collaborator

keewis commented Jan 17, 2020

here's the dask issue about np.percentile's missing axis parameter: dask/dask#2824

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants