Initial minimal working Cubed example for "map-reduce" #352

tomwhite · 2024-04-11T14:11:27Z

This is a first step to implementing #224.

I added a separate code path to the Dask one, in cubed_groupby_agg, since it is sufficiently different (for example, the combine step in Cubed manages memory in a different way).

This PR relies on cubed-dev/cubed#442, which adds the ability to specify the size of the grouping axis.

I added a unit test based on the Dask one, which passes a few cases - but there are plenty more to support (nans, fill values, sorting, etc).

I have included a Jupyter notebook in this PR, which shouldn't be merged, but shows the code working with a cut-down version of the example in https://flox.readthedocs.io/en/latest/user-stories/climatology-hourly.html. (I haven't tried running anything at scale yet.)

Interested to get your feedback @dcherian and @TomNicholas!

for more information, see https://pre-commit.ci

flox/core.py

dcherian · 2024-04-12T15:19:03Z

flox/core.py

-    if not has_dask:
+    if has_cubed:
+        if method is None:
+            method = "map-reduce"


Will also need a assert reindex is True in _validate_reindex

Yeah, not sure where that would go exactly.

dcherian · 2024-04-12T15:20:29Z

Thanks @tomwhite ! Did you have any thoughts on how to refactor this so that we can share code between the dask and cubed paths. Is it feasible to have just the combine stage be different between the two by applying a little adapter function that transforms between dicts and structured arrays?

Co-authored-by: Deepak Cherian <[email protected]>

for more information, see https://pre-commit.ci

tomwhite · 2024-04-15T11:57:19Z

Thanks for reviewing @dcherian! The _finalize_results change is a good improvement!

Did you have any thoughts on how to refactor this so that we can share code between the dask and cubed paths. Is it feasible to have just the combine stage be different between the two by applying a little adapter function that transforms between dicts and structured arrays?

This might be possible, but it would obviously be more work. The logic for map-reduce, blockwise, and cohorts is combined in dask_groupby_agg, which complicates things. Cubed has a slightly different way to do map-reduce, blockwise may be the same (not sure yet), and Cubed may not need cohorts (#224 (comment)).

…7d2581b44468da5b7e29c30c0c49

for more information, see https://pre-commit.ci

tomwhite · 2024-04-18T14:05:11Z

I made a release of Cubed with the changes needed by this PR. I've also tried adding the Cubed tests to CI, so we'll see if that works.

I had a look at the grouping by multiple variables (2D) case, but it's not trivial so I'd rather do it as a follow up (#353).

dcherian · 2024-04-20T14:56:02Z

I'll take a look soon, I promise.

I'm the mean time it would be good to add that notebook as documentation.

tomwhite · 2024-04-22T10:35:25Z

I'll take a look soon, I promise.

Thanks!

I'm the mean time it would be good to add that notebook as documentation.

Where would be a good place to add it do you think?

dcherian · 2024-04-23T18:42:54Z

Under tricks and stories is fine, it's got a collection of notebooks.

Longer term we can update the "Duck Array Support" page.

dcherian · 2024-04-23T20:10:24Z

Also I took a look and this looks good to me. I experimented with some refactoring but I agree that it's be good to see what's needed for blockwise/cohorts before we actually refactor things.

* main: (64 commits) import `normalize_axis_index` from `numpy.lib` on `numpy>=2` (#364) Optimize `min_count` when `expected_groups` is not provided. (#236) Use threadpool for finding labels in chunk (#327) Manually fuse reindexing intermediates with blockwise reduction for cohorts. (#300) Bump codecov/codecov-action from 4.1.1 to 4.3.1 (#362) Add cubed notebook for hourly climatology example using "map-reduce" method (#356) Optimize bitmask finding for chunk size 1 and single chunk cases (#360) Edits to climatology doc (#361) Fix benchmarks (#358) Trim CI (#355) [pre-commit.ci] pre-commit autoupdate (#350) Initial minimal working Cubed example for "map-reduce" (#352) Bump codecov/codecov-action from 4.1.0 to 4.1.1 (#349) `method` heuristics: Avoid dot product as much as possible (#347) Fix nanlen with strings (#344) Fix direct quantile reduction (#343) Fix upstream-dev CI, silence warnings (#341) Bump codecov/codecov-action from 4.0.0 to 4.1.0 (#338) Fix direct reductions of Xarray objects (#339) Test with py3.12 (#336) ...

tomwhite and others added 3 commits April 11, 2024 14:26

Initial minimal working Cubed example for "map-reduce"

cb16e2a

[pre-commit.ci] auto fixes from pre-commit.com hooks

ccae0d6

for more information, see https://pre-commit.ci

Fix misspelled aggegrate_func

3375f28