Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015

Closed
ulijh opened this issue Apr 29, 2020 · 2 comments · Fixed by #4060
Closed

apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015

ulijh opened this issue Apr 29, 2020 · 2 comments · Fixed by #4060
Labels

Comments

@ulijh
Copy link
Contributor

ulijh commented Apr 29, 2020

Applying a function to a data array with dtype = complex returns one with dtype = float. It seems to work before commit 17b70ca.

MCVE Code Sample

import numpy as np
import xarray as xr

def func(x):
    return np.sum(x ** 2)

da = xr.DataArray(np.arange(2*3*4).reshape(2,3,4))
da = da + 1j * da
da = da.chunk(dict(dim_1=1))

da2 = xr.apply_ufunc(
    func,
    da,
    vectorize=True,
    dask="parallelized",
    output_dtypes=[da.dtype],
)

assert da2.dtype == da.dtype, "wrong dtype"

Expected Output

da and da2 should both have the same dtype=complex.

Problem Description

To me it seems to me that the kwarg meta=None somehow causes dask to allocate a float array and ignore the dtype kwargs (which seems to be carried through correctly down to dask.array.blockwise.blockwise()) . I'm not familiar with the apply_ufing and the dask code, so I can't tell on which end the bug sits.

Versions

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.2 (default, Apr 8 2020, 14:31:25)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.6.5-arch3-1
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: de_DE.utf8
LOCALE: de_DE.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.3

xarray: 0.15.2.dev47+g33a66d63
pandas: 1.0.3
numpy: 1.18.3
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.7.4
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.1.1.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.12.0
distributed: 2.14.0
matplotlib: 3.2.1
cartopy: 0.17.0
seaborn: 0.10.0
numbagg: None
pint: None
setuptools: 46.1.3
pip: 20.0.2
conda: None
pytest: 5.4.1
IPython: 7.13.0
sphinx: 3.0.2

@mathause
Copy link
Collaborator

mathause commented May 1, 2020

It works when you set vectorize=False

da2 = xr.apply_ufunc( 
    func, 
    da, 
    vectorize=False, 
    dask="parallelized", 
    output_dtypes=[da.dtype], 
     
)                                                                                                                                                                                       
assert da2.dtype == da.dtype, "wrong dtype"        

or when you pass your own meta:

da2 = xr.apply_ufunc( 
    func, 
    da, 
    vectorize=True, 
    dask="parallelized", 
    output_dtypes=[da.dtype], 
    meta=da 
) 

assert da2.dtype == da.dtype, "wrong dtype"

This also goes wrong if the DataArray has another dtype, e.g. int:

da = xr.DataArray(np.arange(2*3*4).reshape(2,3,4)
da = da.chunk(dict(dim_1=1))
da2 = xr.apply_ufunc(
    func,
    da,
    vectorize=True,
    dask="parallelized",
    output_dtypes=[da.dtype],
)

assert da2.dtype == da.dtype, "wrong dtype"

Indeed, the dtype of meta takes precedence over the dtype (https://github.com/dask/dask/blob/25005e19cc30e8b2877d4dadbaef378ee912bdc0/dask/array/core.py#L1022):

meta : empty ndarray
    empty ndarray created with same NumPy backend, ndim and dtype as the
    Dask Array being created (overrides dtype)

@dcherian
Copy link
Contributor

dcherian commented May 1, 2020

@mathuse is right.

The solution is to use dtype when we create meta for vectorized functions here:

if meta is None:
# set meta=np.ndarray by default for numpy vectorized functions
# work around dask bug computing meta with vectorized functions: GH5642
meta = np.ndarray

@ulijh or @mathause , are either of you up for sending in a PR?

For now the workaround is to pass your own meta = np.ndarray((0,0), dtype=da.dtype)

@dcherian dcherian added the bug label May 1, 2020
kmuehlbauer added a commit to kmuehlbauer/xarray that referenced this issue May 20, 2020
mathause added a commit that referenced this issue Aug 19, 2020
* ENH: use `dask.array.apply_gufunc` in `xr.apply_ufunc` for multiple outputs when `dask='parallelized'`, add/fix tests

* DOC: Update docstring and whats-new.rst

* WIP: apply_gufunc

* WIP: apply_gufunc -> reinstate dask='allowed' as per @mathause, adapting tests

* WIP: apply_gufunc -> add test for GH #4015, fix test for sparse meta checking

* WIP: apply_gufunc -> remove unused `input_dims`

* Update xarray/core/computation.py

Co-authored-by: Mathias Hauser <[email protected]>

* Update xarray/core/computation.py

Co-authored-by: Mathias Hauser <[email protected]>

* Update xarray/core/computation.py

Co-authored-by: Mathias Hauser <[email protected]>

* WIP: use dask_gufunc_kwargs, keep vectorize first but only for non-dask-gufunc, rework docstrings, adapt tests

* DOC: add reference to internal changes in whats-new.rst

* FIX: mypy

* FIX: vectorize inside `apply_variable_ufunc`

* TST: add tests from #4022 from @mathause

* FIX: address black issue

* FIX: xfail test for dask < 2.3

* WIP: apply changes in response to @mathause's review comments

* WIP: remove line

* WIP: catch different chunksize error and allow_rechunk, docstring fixes

* WIP: remove comment

* WIP: style issues

* WIP: revert catch, revert test, add tests without output_dtypes

* WIP: fix signature in apply_ufunc->apply_gufunc, handle output_sizes, handle dask version, fix tests

* WIP: fix tuple

* WIP: add dims_map to _UFuncSignature, adapt output_sizes to fit for apply_gufunc

* WIP: black

* WIP: raise ValueError if output_sizes dimension mismatch

* WIP: raise ValueError if output_sizes is missing for given output_core_dims

* WIP: simplify if/else

* FIX: resolve conflicts prior merge with master

* FIX: combine if's as per review

* FIX: pass `vectorize` and `output_dtypes` kwargs explicitely into `apply_variable_ufunc` as per review suggestion

* FIX: pass `vectorize` and `output_dtypes` kwargs explicitely into `da.apply_gufunc`

* FIX: address review comments of @keewis and @mathause

* FIX: black

* FIX: `vectorize` not needed in if-clause

* FIX: set DeprecationWarning and stacklevel=2

* FIX: use FutureWarning for user visibility

* FIX: remove comment as suggested

Co-authored-by: Deepak Cherian <[email protected]>

Co-authored-by: Mathias Hauser <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants