Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask friendly check in .weighted() #4559

Merged
merged 23 commits into from
Nov 9, 2020

Conversation

jbusecke
Copy link
Contributor

@jbusecke jbusecke commented Oct 31, 2020

@pep8speaks
Copy link

pep8speaks commented Oct 31, 2020

Hello @jbusecke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-09 15:24:48 UTC

@jbusecke jbusecke marked this pull request as draft October 31, 2020 20:14
Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks @jbusecke . I'll let @dcherian or @mathause comment too.

xarray/core/weighted.py Outdated Show resolved Hide resolved
@jbusecke
Copy link
Contributor Author

jbusecke commented Nov 1, 2020

The Ci environments without dask are failing. Should I add some pytest skip logic, or what is the best way to handle this?

@max-sixty
Copy link
Collaborator

Yes, @requires_dask on the test

@keewis
Copy link
Collaborator

keewis commented Nov 1, 2020

since the test you added requires dask it definitely should get a @requires_dask.

That won't fix all the failing tests, though: map_blocks is only valid if we operate on dask arrays. I guess you either have to apply the function directly on other arrays, or we have to make DataArray.map_blocks on non-dask arrays call to DataArray.map.

Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity I would use if dask_duck_array(weights):.


weights = DataArray(weights).chunk({"dim_0": -1})

weighted = data.weighted(weights)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You test that dask does not comoute:

with raise_if_dask_computes():
actual = v.argmax(dim="x")

xarray/core/weighted.py Outdated Show resolved Hide resolved
@mathause
Copy link
Collaborator

mathause commented Nov 1, 2020

I think you need to do someting along the lines of:

if dask_duck_array(weights):
    import dask.array as dsa
    dsa.map_blocks(_weight_check, weights.data, dtype=weights.dtype)
else:
    _weight_check()

@jbusecke
Copy link
Contributor Author

jbusecke commented Nov 1, 2020

I did have to fiddle with this a bit. I did change .isnull() to np.isnan() and replace the data of weights, otherwise the error would not be raised for the dask array at all.
This does not look terribly elegant to me, but passes tests locally. Waiting for the CI to see if the others pass aswell. Happy to make further changes, and thanks for all the input already.

"`weights` cannot contain missing values. "
"Missing values can be replaced by `weights.fillna(0)`."
def _weight_check(w):
if np.isnan(w).any():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.isnull() does a bit more than that: np.isnan won't detect NaT. @mathause, how likely is it to get datetime-like arrays here? They don't make much sense as weights, but as far as I can tell we don't check (I might be missing something, though)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no check for that. A TimeDelta may make some sense as weights. DateTime not so much. I think we can get away with using np.isnan. A Date* array as weights containing NaT should be super uncommon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could still operate on the dataarray instead of the dask/numpy array, but as @dcherian suggesred, that would be less efficient. I would be curious as to what penalties would actually occur when we use the weights.map_blocks compared to dask.array.map_blocks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Could use duck_array_ops.isnull to account for timedelta64? It is weird to have it as a weight though. Does that work?
  2. Re map_blocks: the xarray version adds tasks that create xarray objects wrapping every block in a dask array. That adds overhead which is totally unneccesary here.

@jbusecke jbusecke marked this pull request as ready for review November 2, 2020 19:17
@jbusecke
Copy link
Contributor Author

jbusecke commented Nov 3, 2020

Do you think this works or are further changes needed? Many thanks for the guidance so far!

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. Want to add a whatsnew?

xarray/core/weighted.py Outdated Show resolved Hide resolved
if is_duck_dask_array(weights.data):
import dask.array as dsa

weights.data = dsa.map_blocks(
Copy link
Contributor

@dcherian dcherian Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
weights.data = dsa.map_blocks(
weights = weights.copy(data=dsa.map_blocks(

so we don't modify the original object. Could even do weights.data.map_blocks(...) to save some typing...

@jbusecke jbusecke changed the title Dask friendly check for nans in .weighted() Dask friendly check in .weighted() Nov 4, 2020
@jbusecke
Copy link
Contributor Author

jbusecke commented Nov 4, 2020

I am getting some failures for py38-upstream (TestDataset.test_polyfit_warnings) that seem unrelated?

@jbusecke
Copy link
Contributor Author

jbusecke commented Nov 4, 2020

Similarly on MacOSX py38 this fails: TestDask.test_save_mfdataset_compute_false_roundtrip

@keewis
Copy link
Collaborator

keewis commented Nov 4, 2020

the TestDask.test_save_mfdataset_compute_false_roundtrip failure should be #4539, and the failing upstream-dev CI shouldn't be related, either (my guess is that a dependency – most likely numpy – added a new warning). I'll investigate and open a new issue for that (edit: see #4565).

@jbusecke
Copy link
Contributor Author

jbusecke commented Nov 5, 2020

Ok I think this should be good to go. I have implemented all the requested changes. The remaining failures are related to other problems upstream (I think). Anything else I should add here?

xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
@jbusecke
Copy link
Contributor Author

jbusecke commented Nov 6, 2020

I am not understanding why that MinimumVersionPolicy test is failing (or what it does haha). Is this something upstream, or should I fix?

@keewis
Copy link
Collaborator

keewis commented Nov 6, 2020

this CI is sometimes flaky, but it's usually enough to just rerun it. I'll do that for you once the other CI finished.

@jbusecke
Copy link
Contributor Author

jbusecke commented Nov 6, 2020

Seems like all the other test are passing (minus the two upstream problems discussed before).

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jbusecke great contribution!

xarray/core/weighted.py Outdated Show resolved Hide resolved
@mathause mathause merged commit 5a60354 into pydata:master Nov 9, 2020
@mathause
Copy link
Collaborator

mathause commented Nov 9, 2020

Thanks @jbusecke

AndrewILWilliams added a commit to AndrewILWilliams/xarray that referenced this pull request May 16, 2021
max-sixty pushed a commit that referenced this pull request May 27, 2021
* initial changes

* Using map_blocks to lazily mask input arrays, following #4559

* Adding lazy corr cov test with `raise_if_dask_computes`

* adding test for one da without nans

* checking ordering of arrays doesnt matter

Co-authored-by: Deepak Cherian <[email protected]>

* adjust inputs to test

* add test for array with no missing values

* added whatsnew

* fixing format issues

Co-authored-by: Deepak Cherian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to skip tests in weighted()
6 participants