apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015

ulijh · 2020-04-29T11:17:48Z

Applying a function to a data array with dtype = complex returns one with dtype = float. It seems to work before commit 17b70ca.

MCVE Code Sample

import numpy as np
import xarray as xr

def func(x):
    return np.sum(x ** 2)

da = xr.DataArray(np.arange(2*3*4).reshape(2,3,4))
da = da + 1j * da
da = da.chunk(dict(dim_1=1))

da2 = xr.apply_ufunc(
    func,
    da,
    vectorize=True,
    dask="parallelized",
    output_dtypes=[da.dtype],
)

assert da2.dtype == da.dtype, "wrong dtype"

Expected Output

da and da2 should both have the same dtype=complex.

Problem Description

To me it seems to me that the kwarg meta=None somehow causes dask to allocate a float array and ignore the dtype kwargs (which seems to be carried through correctly down to dask.array.blockwise.blockwise()) . I'm not familiar with the apply_ufing and the dask code, so I can't tell on which end the bug sits.

Versions

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.2 (default, Apr 8 2020, 14:31:25)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.6.5-arch3-1
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: de_DE.utf8
LOCALE: de_DE.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.3

xarray: 0.15.2.dev47+g33a66d63
pandas: 1.0.3
numpy: 1.18.3
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.7.4
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.1.1.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.12.0
distributed: 2.14.0
matplotlib: 3.2.1
cartopy: 0.17.0
seaborn: 0.10.0
numbagg: None
pint: None
setuptools: 46.1.3
pip: 20.0.2
conda: None
pytest: 5.4.1
IPython: 7.13.0
sphinx: 3.0.2

The text was updated successfully, but these errors were encountered:

mathause · 2020-05-01T12:49:42Z

It works when you set vectorize=False

da2 = xr.apply_ufunc( 
    func, 
    da, 
    vectorize=False, 
    dask="parallelized", 
    output_dtypes=[da.dtype], 
     
)                                                                                                                                                                                       
assert da2.dtype == da.dtype, "wrong dtype"

or when you pass your own meta:

da2 = xr.apply_ufunc( 
    func, 
    da, 
    vectorize=True, 
    dask="parallelized", 
    output_dtypes=[da.dtype], 
    meta=da 
) 

assert da2.dtype == da.dtype, "wrong dtype"

This also goes wrong if the DataArray has another dtype, e.g. int:

da = xr.DataArray(np.arange(2*3*4).reshape(2,3,4)
da = da.chunk(dict(dim_1=1))
da2 = xr.apply_ufunc(
    func,
    da,
    vectorize=True,
    dask="parallelized",
    output_dtypes=[da.dtype],
)

assert da2.dtype == da.dtype, "wrong dtype"

Indeed, the dtype of meta takes precedence over the dtype (https://github.com/dask/dask/blob/25005e19cc30e8b2877d4dadbaef378ee912bdc0/dask/array/core.py#L1022):

meta : empty ndarray
    empty ndarray created with same NumPy backend, ndim and dtype as the
    Dask Array being created (overrides dtype)

dcherian · 2020-05-01T22:47:36Z

@mathuse is right.

The solution is to use dtype when we create meta for vectorized functions here:

xarray/xarray/core/computation.py

Lines 1008 to 1011 in 3820fb7

    
           if meta is None: 
        
               # set meta=np.ndarray by default for numpy vectorized functions 
        
               # work around dask bug computing meta with vectorized functions: GH5642 
        
               meta = np.ndarray

@ulijh or @mathause , are either of you up for sending in a PR?

For now the workaround is to pass your own meta = np.ndarray((0,0), dtype=da.dtype)

… meta checking

@mathause

* ENH: use `dask.array.apply_gufunc` in `xr.apply_ufunc` for multiple outputs when `dask='parallelized'`, add/fix tests * DOC: Update docstring and whats-new.rst * WIP: apply_gufunc * WIP: apply_gufunc -> reinstate dask='allowed' as per @mathause, adapting tests * WIP: apply_gufunc -> add test for GH #4015, fix test for sparse meta checking * WIP: apply_gufunc -> remove unused `input_dims` * Update xarray/core/computation.py Co-authored-by: Mathias Hauser <[email protected]> * Update xarray/core/computation.py Co-authored-by: Mathias Hauser <[email protected]> * Update xarray/core/computation.py Co-authored-by: Mathias Hauser <[email protected]> * WIP: use dask_gufunc_kwargs, keep vectorize first but only for non-dask-gufunc, rework docstrings, adapt tests * DOC: add reference to internal changes in whats-new.rst * FIX: mypy * FIX: vectorize inside `apply_variable_ufunc` * TST: add tests from #4022 from @mathause * FIX: address black issue * FIX: xfail test for dask < 2.3 * WIP: apply changes in response to @mathause's review comments * WIP: remove line * WIP: catch different chunksize error and allow_rechunk, docstring fixes * WIP: remove comment * WIP: style issues * WIP: revert catch, revert test, add tests without output_dtypes * WIP: fix signature in apply_ufunc->apply_gufunc, handle output_sizes, handle dask version, fix tests * WIP: fix tuple * WIP: add dims_map to _UFuncSignature, adapt output_sizes to fit for apply_gufunc * WIP: black * WIP: raise ValueError if output_sizes dimension mismatch * WIP: raise ValueError if output_sizes is missing for given output_core_dims * WIP: simplify if/else * FIX: resolve conflicts prior merge with master * FIX: combine if's as per review * FIX: pass `vectorize` and `output_dtypes` kwargs explicitely into `apply_variable_ufunc` as per review suggestion * FIX: pass `vectorize` and `output_dtypes` kwargs explicitely into `da.apply_gufunc` * FIX: address review comments of @keewis and @mathause * FIX: black * FIX: `vectorize` not needed in if-clause * FIX: set DeprecationWarning and stacklevel=2 * FIX: use FutureWarning for user visibility * FIX: remove comment as suggested Co-authored-by: Deepak Cherian <[email protected]> Co-authored-by: Mathias Hauser <[email protected]> Co-authored-by: Deepak Cherian <[email protected]>

dcherian added the bug label May 1, 2020

mathause mentioned this issue May 3, 2020

Fix/apply ufunc meta dtype #4022

Closed

4 tasks

mathause mentioned this issue May 15, 2020

ENH: use dask.array.apply_gufunc in xr.apply_ufunc #4060

Merged

7 tasks

kmuehlbauer added a commit to kmuehlbauer/xarray that referenced this issue May 20, 2020

WIP: apply_gufunc -> add test for GH pydata#4015, fix test for sparse…

a17ca32

… meta checking

mathause closed this as completed in #4060 Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015

apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015

ulijh commented Apr 29, 2020

INSTALLED VERSIONS

mathause commented May 1, 2020

dcherian commented May 1, 2020

apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015

apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015

Comments

ulijh commented Apr 29, 2020

MCVE Code Sample

Expected Output

Problem Description

Versions

INSTALLED VERSIONS

mathause commented May 1, 2020

dcherian commented May 1, 2020