ENH: use `dask.array.apply_gufunc` in `xr.apply_ufunc` #4060

kmuehlbauer · 2020-05-14T09:38:15Z

use dask.array.apply_gufunc in xr.apply_ufunc for multiple outputs when dask='parallelized', add/fix tests

Closes apply_ufunc(dask='parallelized') with multiple outputs #1815, closes apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015
Tests added
Passes isort -rc . && black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

Remaining Issues:

fitting name for current dask_gufunc_kwargs
rephrase dask docs to fit new behaviour
combine output_core_dims and output_sizes, eg. xr.apply_ufunc(..., output_core_dims=[{"abc": 2]])

…utputs when `dask='parallelized'`, add/fix tests

kmuehlbauer · 2020-05-14T10:07:17Z

This would need some docstring changing too. But I first want to check, if I've missed anything vital in the implementation.

kmuehlbauer · 2020-05-14T14:57:23Z

This is ready for review from my side.

mathause · 2020-05-15T13:02:03Z

It might be good to add a test with a reduction and one with vectorize=True.
Would it be possible to replace the call to dask.array.blockwise (for one output variable) with dask.array.apply_gufunc? Do you know why blockwise is used further below and not dask.array.apply_gufunc? I assume it's due to historical reasons but I am not sure.
dask.array.apply_gufunc does all sorts of stuff - e.g. infer meta. This could potentially solve apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015 (pull Fix/apply ufunc meta dtype #4022) and simplify the call signature of apply_ufunc?

https://github.com/dask/dask/blob/3573b2ddca81aeb41a7def6dd4194020f853ab18/dask/array/gufunc.py#L175

kmuehlbauer · 2020-05-15T13:15:19Z

Thanks @mathause for your comments and raising those questions. JFTR, I was taking the road from #1815, so my explicit use-case was the multiple (dask) outputs.

It might be good to add a test with a reduction and one with vectorize=True.

I'll try to add some tests for the multiple output using dask.

Would it be possible to replace the call to dask.array.blockwise (for one output variable) with dask.array.apply_gufunc? Do you know why blockwise is used further below and not dask.array.apply_gufunc? I assume it's due to historical reasons but I am not sure.

AFAIK, apply_gufunc wasn't available at the time these functions were introduced. Good chance, that apply_gufunc can be used for handling single output dask too.

dask.array.apply_gufunc does all sorts of stuff - e.g. infer meta. This could potentially solve apply_ufunc gives wrong dtype with dask=parallelized and vectorized=True #4015 (pull Fix/apply ufunc meta dtype #4022) and simplify the call signature of apply_ufunc?

That's a good question. If you want me to go the long way, please be aware, that I'm a novice in xarray as well as in dask. A complete refactor of apply_ufunc would be quite some challenge.

mathause · 2020-05-15T14:36:56Z

Ah yes I see (#1815 (comment)). dask.array.apply_gufunc should also be able to handle one output only.

A complete refactor of apply_ufunc would be quite some challenge.

Indeed - I think it could simplify _apply_blockwise (and might make the meta keword obsolete) but it would be good if someone with more experience of dask could weigh in.

@dcherian @shoyer

shoyer · 2020-05-15T16:56:04Z

Would it be possible to replace the call to dask.array.blockwise (for one output variable) with dask.array.apply_gufunc? Do you know why blockwise is used further below and not dask.array.apply_gufunc? I assume it's due to historical reasons but I am not sure.

AFAIK, apply_gufunc wasn't available at the time these functions were introduced. Good chance, that apply_gufunc can be used for handling single output dask too.

Exactly. It would be nice remove the use of blockwise entirely in favor of apply_gufunc.

kmuehlbauer · 2020-05-19T14:03:42Z

I've given this a try, but this will need some design decisions.

currently vectorize is handled in any case if requested, before falling through the if/else.
dask.array.apply_gufunc takes vectorize as parameter, so for dask we do not need to apply vectorization. We would need to apply vectorize only for non-dask cases (maybe just before calling the final function).
currently dask is only handled, if dask keyword is issued ('allowed' and parallelized). From my perspective the dask keyword is not needed any more. We could just divert to the apply_gufunc when dask backed arrays are detected.

This will really have much impact on the code/tests. I'll come up with a updated PullRequest in short time, but any thoughts /remarks whatsoever are very much appreciated.

mathause · 2020-05-19T14:13:38Z

Thanks! That would be quite cool!

dask.array.apply_gufunc does some fancy stuff with np.vectorize (to determine the output_dtypes) so I would not vectorize the function ourselves. For the other I don't have a qualified opinion.

pep8speaks · 2020-05-20T07:28:08Z

Hello @kmuehlbauer! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-19 05:41:15 UTC

kmuehlbauer · 2020-05-20T07:40:24Z

@mathause @shoyer

First serve of trying to use dask.array.apply_gufunc in xr.apply_ufunc. I've added a list with problems in the topmost comment of this PR, to not loose track of this. Please enhance that list, if needed.

Most problematic issue now: xr.dot doesn't work well with apply_gufunc with regard to core dimensions and chunking.

kmuehlbauer · 2020-05-20T07:48:01Z

From looking at the tests, it seems that setting allow_rechunk=True would solve many issues. I've no idea about the implications on memory usage (docstring: "Warning: enabling this can increase memory usage significantly"). Might apply_gufunc not be suitable for processing of eg. dot?

mathause · 2020-05-20T09:20:54Z

Should you keep the dask="allowed" branch? That might solve some of the issues (instead of using allow_rechunk=True) . For example dask has it's own einsum implementation (used in xr.dot) so it may not be necessary to pipe that through dask.array.apply_gufunc (from the docstring: this function is like np.vectorize, but for the blocks of dask arrays). However, I am only speculating.

kmuehlbauer · 2020-05-20T09:50:16Z

@mathause Great! Seems like this works better, thanks! Will update the PR after some more tests etc.

…ing tests

kmuehlbauer · 2020-05-20T12:40:24Z

@mathause Only one breaking test which is connected to the meta stuff you handle in #4022. Any suggestions on that topic? I've removed meta completely since it was only needed in dask.array.blockwise but not in dask.array.apply_gufunc.

… meta checking

kmuehlbauer · 2020-05-20T14:02:07Z

@mathause All tests green, good starting point for review. Please notify other people who should have a look at this.

There are still things to discuss:

keywords of dask.array.apply_gufunc
howto properly handle deprecation (eg. meta)
doscstring
adding/revision of tests

shoyer · 2020-05-20T19:50:23Z

The original motivation for requiring dask='allowed' is that I was concerned that users would put a function that coerces its arguments into NumPy arrays into apply_ufunc (e.g., like many functions from SciPy), which could have surprisingly bad performance when called on dask arrays due to automatic coercion.

Maybe this is too defensive/surprising, and could be relaxed. We don't really have any guard-rails like this elsewhere in xarray.

mathause · 2020-05-20T20:12:25Z

Maybe this is too defensive/surprising, and could be relaxed.

You would remove the daks="forbidden" branch and not the dask="parallelized"?

For the functions that don't handle dask arrays gracefully, dask="parallelized" would be the better option?

Very cool - good progress.

I guess you'll have to properly deprecate meta, something along the lines of: meta is no longer necessary and has no effect. it will be removed in a future version
I think it would be good to pass allow_rechunk.

I'll only be able to look at it properly next week.

shoyer · 2020-05-20T20:16:09Z

Maybe this is too defensive/surprising, and could be relaxed.

You would remove the daks="forbidden" branch and not the dask="parallelized"?

For the functions that don't handle dask arrays gracefully, dask="parallelized" would be the better option?

This is probably another good motivation: defaulting to dask='forbidden' forces users to make an explicit choice about whether or not use dask='parallelized'.

The problem is that we don't have any way to detect ahead of time whether the applied function already supports dask arrays (e.g., if it is built-up out of functions from dask.array). If it does, we don't want to set dask='parallelized' but rather let the function handle dask arrays itself.

dcherian · 2020-05-20T20:24:32Z

If it does, we don't want to set dask='parallelized' but rather let the function handle dask arrays itself.

I think we still need all the current options for the dask kwarg. There can be disastrous consequences so it's good to make users explicitly choose the behaviour they want.

howto properly handle deprecation (eg. meta)

I don't think we should deprecate meta. Not all user functions can deal with zero shaped inputs, so automatically inferring meta need not always work. We've had to add a similar feature for map_blocks (#3575) so I think meta should stay.

keywords of dask.array.apply_gufunc

Shall we add a new dask_gufunc_kwargs and pass that down to appy_gufunc?

mathause · 2020-05-20T21:13:03Z

I don't see meta listed in the docs which is why it thought it's not needed. But if it is handled in dask.array.apply_gufunc it can of course stay.

I only realised the exact distinction between "allowed" and "parallelized" today - i.e. that "parallelized" is kind of the dask equivalent of np.vectorize. I can suggest something for the docstring (e.g. prefer "allowed" if ``func`` natively handles dask arrays or so)

dcherian · 2020-05-20T23:11:41Z

good point @mathause. Looks like apply_gufunc tries to make blockwise infer meta and then does its own thing when that fails:
https://github.com/dask/dask/blob/3b92efff2e779f59e95e05af9b8f371d56227d02/dask/array/gufunc.py#L417-L440 . I don't understand how this can work in all possible cases.

kmuehlbauer · 2020-05-22T07:41:49Z

I'll only be able to look at it properly next week.

@mathause I'll leave the PR unchanged and catch up with you next week.

@shoyer @dcherian Thanks for your comments. Please let me know, which tests should be added to check for any possible surprises with this change to apply_gufunc.

The problem is that we don't have any way to detect ahead of time whether the applied function already supports dask arrays (e.g., if it is built-up out of functions from dask.array). If it does, we don't want to set dask='parallelized' but rather let the function handle dask arrays itself.

(Att: no native english speaker here, so bear with me, if something sounds clunky or not exactly matching)
Then we would have to keep the dask='forbidden' as default, as well as parallelized and allowed to force the decision to the user. Maybe the keyword settings itself could be a bit more clear. In the allowed-case the function in question has to natively support dask-arrays. So I would use dask='native' in that case. For the parallelized-case this PR proposes to use dask.array.apply_gufunc (generalized ufunc). So either we stick to parallelized or we try to find a better fitting name (eg. dask='gufunc').

For the keywords I think @dcherian 's proposal of something like dask_gufunc_kwargs (or gufunc_kwargs) is useful (would match with dask='gufunc'), although only two keywords seem to be worth feeding through (keepdims, allow_rechunk).

kmuehlbauer · 2020-08-17T10:35:16Z

@keewis @mathause Thanks for the review. I've added a checklist in the first post with "open issues" with this PR, which might be solved in a follow up PR. Would be good to know which need to go in here, so I can add this.

keewis · 2020-08-17T10:40:42Z

I think all of these can be done in a new PR, we just have to make sure to include them in the next release (which might need to be soon so we regain compatibility with the most recent pandas).

kmuehlbauer · 2020-08-17T10:43:26Z

Great, than it looks like it's finally done. 😃

kmuehlbauer · 2020-08-17T11:21:38Z

While having a last review I've found another small glitch. I'll come back the next days to see, if anything needs to be done from reviewers side.

kmuehlbauer · 2020-08-18T05:55:06Z

@mathause I've merged latest master into this PR to hopefully get all tests green. The former had some problems with a conda error in MinimumVersions job.

Please let me know, if there is anything for me to do, to get this merged.

shoyer

Thanks for patience here with the slow reviews. Looking this over, I have a suggestion for how to improve the warnings, but otherwise this looks good!

shoyer · 2020-08-18T06:40:14Z

xarray/core/computation.py

+            warnings.warn(
+                "``meta`` should be given in the ``dask_gufunc_kwargs`` parameter."
+                " It will be removed as direct parameter in a future version."
+            )


Could you please set a class (DeprecationWarning) and stacklevel=2 on these warnings? That results in better messages for users.

Sorry to nitpick - shouldn't that be a FutureWarning so that users actually get to see it?

@mathause At least in the tests the warnings are issued .

What's the actual difference between DeprecationWarning and FutureWarning (update: just found PendingDeprecationWarning)? And when should they be used? Just to know for future contributions.

FutureWarning would be fine, too. We should probably try to come to consensus on a general policy for xarray.

The Python docs have some guidance but the overall recommendation is not really clear to me: https://docs.python.org/3/library/warnings.html#warning-categories

FutureWarning is for users and DeprecationWarning for library authors (https://docs.python.org/3/library/warnings.html#warning-categories). Which is why you see DeprecationWarning in the test but won't when you execute the code. Took me a while to figure this out when I wanted to deprecate some stuff in my package.

import warnings def test(): warnings.warn("DeprecationWarning", DeprecationWarning) warnings.warn("FutureWarning", FutureWarning)

If you try this in ipython test() will raise both warnings. But if you save to a file and try

from test_warnings import test test()

only FutureWarning will appear (I did not know this detail either https://www.python.org/dev/peps/pep-0565/).

@mathause @shoyer I'll switch to FutureWarning since this seems to be the only user-visible warning, See https://www.python.org/dev/peps/pep-0565/#additional-use-case-for-futurewarning

And, thanks for the pointers and explanations.

mathause · 2020-08-18T09:02:58Z

Nice! Unless @dcherian has any additional comments I'll merge in a few days

dcherian

Looks great! Thanks @kmuehlbauer this is a great improvement!

xarray/tests/test_sparse.py

Co-authored-by: Deepak Cherian <[email protected]>

mathause · 2020-08-19T06:57:29Z

ok then - let's do this. Thanks a lot @kmuehlbauer

kmuehlbauer · 2020-08-19T08:52:40Z

Thanks to all reviewers! Great job!

…ternal use of `apply_ufunc` (follow-up to pydata#4060, fixes pydata#4385)

#4391) * move kwarg's `output_sizes` and `meta` to `dask_gufunc_kwargs` for internal use of `apply_ufunc` (follow-up to #4060, fixes #4385) * add pull request referenz to `whats-new.rst`

ENH: use dask.array.apply_gufunc in xr.apply_ufunc for multiple o…

c18655f

…utputs when `dask='parallelized'`, add/fix tests

DOC: Update docstring and whats-new.rst

3bf1d75

mathause mentioned this pull request May 15, 2020

Fix/apply ufunc meta dtype #4022

Closed

4 tasks

WIP: apply_gufunc

fff7660

WIP: apply_gufunc -> reinstate dask='allowed' as per @mathause, adapt…

d8bcb15

…ing tests

kmuehlbauer added 2 commits May 20, 2020 15:14

WIP: apply_gufunc -> add test for GH pydata#4015, fix test for sparse…

a17ca32

… meta checking

WIP: apply_gufunc -> remove unused input_dims

5f3f847

FIX: black

5a1f15e

FIX: vectorize not needed in if-clause

a05bd18

Merge remote-tracking branch 'origin/master' into fix-1815

35ae2a9

shoyer approved these changes Aug 18, 2020

View reviewed changes

kmuehlbauer added 2 commits August 18, 2020 08:46

FIX: set DeprecationWarning and stacklevel=2

2fc6272

FIX: use FutureWarning for user visibility

4cb059e

dcherian approved these changes Aug 19, 2020

View reviewed changes

xarray/tests/test_sparse.py Outdated Show resolved Hide resolved

FIX: remove comment as suggested

4a48acd

Co-authored-by: Deepak Cherian <[email protected]>

mathause merged commit a7fb5a9 into pydata:master Aug 19, 2020

kmuehlbauer mentioned this pull request Aug 25, 2020

Set allow_rechunk=True in apply_ufunc #4372

Closed

mathause mentioned this pull request Aug 28, 2020

warnings from internal use of apply_ufunc #4385

Closed

kmuehlbauer added a commit to kmuehlbauer/xarray that referenced this pull request Aug 30, 2020

move kwarg's output_sizes and meta to dask_gufunc_kwargs for in…

7def983

…ternal use of `apply_ufunc` (follow-up to pydata#4060, fixes pydata#4385)

kmuehlbauer mentioned this pull request Aug 30, 2020

move kwarg's output_sizes and meta to dask_gufunc_kwargs for in… #4391

Merged

5 tasks

This was referenced Aug 31, 2020

FIX: handle dask ValueErrors in apply_ufunc (set allow_rechunk=True) #4392

Merged

Dask gufunc kwarg "output_sizes" is not deep copied #4399

Closed

slevang mentioned this pull request Oct 2, 2020

regrid_dataset broken with xarray=0.16.1 pangeo-data/xESMF#36

Closed

TomNicholas mentioned this pull request May 17, 2021

apply_ufunc support for chunks on input_core_dims #1995

Open

TomNicholas mentioned this pull request May 26, 2021

Corrected reference to blockwise to refer to apply_gufunc instead #5383

Merged

2 tasks

slevang mentioned this pull request Jul 21, 2021

FutureWarning in apply_ufunc pangeo-data/xESMF#103

Merged

kmuehlbauer deleted the fix-1815 branch May 25, 2023 07:12

ENH: use dask.array.apply_gufunc in xr.apply_ufunc #4060

ENH: use dask.array.apply_gufunc in xr.apply_ufunc #4060

Conversation

kmuehlbauer commented May 14, 2020 • edited Loading

kmuehlbauer commented May 14, 2020

kmuehlbauer commented May 14, 2020

mathause commented May 15, 2020

kmuehlbauer commented May 15, 2020

mathause commented May 15, 2020

shoyer commented May 15, 2020

kmuehlbauer commented May 19, 2020 • edited Loading

mathause commented May 19, 2020

pep8speaks commented May 20, 2020 • edited Loading

Comment last updated at 2020-08-19 05:41:15 UTC

kmuehlbauer commented May 20, 2020

kmuehlbauer commented May 20, 2020

mathause commented May 20, 2020

kmuehlbauer commented May 20, 2020

kmuehlbauer commented May 20, 2020

kmuehlbauer commented May 20, 2020

shoyer commented May 20, 2020

mathause commented May 20, 2020

shoyer commented May 20, 2020

dcherian commented May 20, 2020

mathause commented May 20, 2020

dcherian commented May 20, 2020

kmuehlbauer commented May 22, 2020

kmuehlbauer commented Aug 17, 2020

keewis commented Aug 17, 2020 • edited Loading

kmuehlbauer commented Aug 17, 2020

kmuehlbauer commented Aug 17, 2020

kmuehlbauer commented Aug 18, 2020

shoyer left a comment

Choose a reason for hiding this comment

shoyer Aug 18, 2020

Choose a reason for hiding this comment

mathause Aug 18, 2020

Choose a reason for hiding this comment

kmuehlbauer Aug 18, 2020 • edited Loading

Choose a reason for hiding this comment

shoyer Aug 18, 2020

Choose a reason for hiding this comment

mathause Aug 18, 2020

Choose a reason for hiding this comment

kmuehlbauer Aug 18, 2020 • edited Loading

Choose a reason for hiding this comment

kmuehlbauer Aug 18, 2020

Choose a reason for hiding this comment

mathause commented Aug 18, 2020

dcherian left a comment

Choose a reason for hiding this comment

mathause commented Aug 19, 2020

kmuehlbauer commented Aug 19, 2020

ENH: use `dask.array.apply_gufunc` in `xr.apply_ufunc` #4060

ENH: use `dask.array.apply_gufunc` in `xr.apply_ufunc` #4060

kmuehlbauer commented May 14, 2020 •

edited

Loading

kmuehlbauer commented May 19, 2020 •

edited

Loading

pep8speaks commented May 20, 2020 •

edited

Loading

keewis commented Aug 17, 2020 •

edited

Loading

kmuehlbauer Aug 18, 2020 •

edited

Loading

kmuehlbauer Aug 18, 2020 •

edited

Loading