-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask friendly check in .weighted()
#4559
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Ci environments without dask are failing. Should I add some pytest skip logic, or what is the best way to handle this? |
Yes, |
since the test you added requires That won't fix all the failing tests, though: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity I would use if dask_duck_array(weights):
.
xarray/tests/test_weighted.py
Outdated
|
||
weights = DataArray(weights).chunk({"dim_0": -1}) | ||
|
||
weighted = data.weighted(weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You test that dask does not comoute:
xarray/xarray/tests/test_dask.py
Lines 189 to 190 in 83884a1
with raise_if_dask_computes(): | |
actual = v.argmax(dim="x") |
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Maximilian Roos <[email protected]>
I think you need to do someting along the lines of: if dask_duck_array(weights):
import dask.array as dsa
dsa.map_blocks(_weight_check, weights.data, dtype=weights.dtype)
else:
_weight_check() |
I did have to fiddle with this a bit. I did change |
xarray/core/weighted.py
Outdated
"`weights` cannot contain missing values. " | ||
"Missing values can be replaced by `weights.fillna(0)`." | ||
def _weight_check(w): | ||
if np.isnan(w).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.isnull()
does a bit more than that: np.isnan
won't detect NaT
. @mathause, how likely is it to get datetime-like arrays here? They don't make much sense as weights, but as far as I can tell we don't check (I might be missing something, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no check for that. A TimeDelta
may make some sense as weights. DateTime
not so much. I think we can get away with using np.isnan
. A Date*
array as weights containing NaT
should be super uncommon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could still operate on the dataarray instead of the dask/numpy array, but as @dcherian suggesred, that would be less efficient. I would be curious as to what penalties would actually occur when we use the weights.map_blocks
compared to dask.array.map_blocks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could use
duck_array_ops.isnull
to account fortimedelta64
? It is weird to have it as a weight though. Does that work? - Re map_blocks: the xarray version adds tasks that create xarray objects wrapping every block in a dask array. That adds overhead which is totally unneccesary here.
Do you think this works or are further changes needed? Many thanks for the guidance so far! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. Want to add a whatsnew?
xarray/core/weighted.py
Outdated
if is_duck_dask_array(weights.data): | ||
import dask.array as dsa | ||
|
||
weights.data = dsa.map_blocks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weights.data = dsa.map_blocks( | |
weights = weights.copy(data=dsa.map_blocks( |
so we don't modify the original object. Could even do weights.data.map_blocks(...)
to save some typing...
Co-authored-by: Maximilian Roos <[email protected]>
.weighted()
.weighted()
I am getting some failures for |
Similarly on |
Ok I think this should be good to go. I have implemented all the requested changes. The remaining failures are related to other problems upstream (I think). Anything else I should add here? |
Co-authored-by: Mathias Hauser <[email protected]>
I am not understanding why that |
this CI is sometimes flaky, but it's usually enough to just rerun it. I'll do that for you once the other CI finished. |
Seems like all the other test are passing (minus the two upstream problems discussed before). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jbusecke great contribution!
Thanks @jbusecke |
* initial changes * Using map_blocks to lazily mask input arrays, following #4559 * Adding lazy corr cov test with `raise_if_dask_computes` * adding test for one da without nans * checking ordering of arrays doesnt matter Co-authored-by: Deepak Cherian <[email protected]> * adjust inputs to test * add test for array with no missing values * added whatsnew * fixing format issues Co-authored-by: Deepak Cherian <[email protected]>
weighted()
#4541isort . && black . && mypy . && flake8
whats-new.rst