-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial minimal working Cubed example for "map-reduce" #352
Conversation
if not has_dask: | ||
if has_cubed: | ||
if method is None: | ||
method = "map-reduce" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will also need a assert reindex is True
in _validate_reindex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not sure where that would go exactly.
Thanks @tomwhite ! Did you have any thoughts on how to refactor this so that we can share code between the dask and cubed paths. Is it feasible to have just the |
Co-authored-by: Deepak Cherian <[email protected]>
for more information, see https://pre-commit.ci
Thanks for reviewing @dcherian! The
This might be possible, but it would obviously be more work. The logic for map-reduce, blockwise, and cohorts is combined in |
for more information, see https://pre-commit.ci
I made a release of Cubed with the changes needed by this PR. I've also tried adding the Cubed tests to CI, so we'll see if that works. I had a look at the grouping by multiple variables (2D) case, but it's not trivial so I'd rather do it as a follow up (#353). |
I'll take a look soon, I promise. I'm the mean time it would be good to add that notebook as documentation. |
Thanks!
Where would be a good place to add it do you think? |
Under tricks and stories is fine, it's got a collection of notebooks. Longer term we can update the "Duck Array Support" page. |
Also I took a look and this looks good to me. I experimented with some refactoring but I agree that it's be good to see what's needed for blockwise/cohorts before we actually refactor things. |
* main: (64 commits) import `normalize_axis_index` from `numpy.lib` on `numpy>=2` (#364) Optimize `min_count` when `expected_groups` is not provided. (#236) Use threadpool for finding labels in chunk (#327) Manually fuse reindexing intermediates with blockwise reduction for cohorts. (#300) Bump codecov/codecov-action from 4.1.1 to 4.3.1 (#362) Add cubed notebook for hourly climatology example using "map-reduce" method (#356) Optimize bitmask finding for chunk size 1 and single chunk cases (#360) Edits to climatology doc (#361) Fix benchmarks (#358) Trim CI (#355) [pre-commit.ci] pre-commit autoupdate (#350) Initial minimal working Cubed example for "map-reduce" (#352) Bump codecov/codecov-action from 4.1.0 to 4.1.1 (#349) `method` heuristics: Avoid dot product as much as possible (#347) Fix nanlen with strings (#344) Fix direct quantile reduction (#343) Fix upstream-dev CI, silence warnings (#341) Bump codecov/codecov-action from 4.0.0 to 4.1.0 (#338) Fix direct reductions of Xarray objects (#339) Test with py3.12 (#336) ...
This is a first step to implementing #224.
I added a separate code path to the Dask one, in
cubed_groupby_agg
, since it is sufficiently different (for example, the combine step in Cubed manages memory in a different way).This PR relies on cubed-dev/cubed#442, which adds the ability to specify the size of the grouping axis.
I added a unit test based on the Dask one, which passes a few cases - but there are plenty more to support (nans, fill values, sorting, etc).
I have included a Jupyter notebook in this PR, which shouldn't be merged, but shows the code working with a cut-down version of the example in https://flox.readthedocs.io/en/latest/user-stories/climatology-hourly.html. (I haven't tried running anything at scale yet.)
Interested to get your feedback @dcherian and @TomNicholas!