Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError when 'dask_histogram.boost.Histogram().Fill()' with dask dataframe #161

Open
RobinTimTom opened this issue Jan 8, 2025 · 5 comments

Comments

@RobinTimTom
Copy link

Dear experts,
I am starting to use dask and dask_histogram, but I am facing an error when I want to fill a dask_histogram.boost with a dataframe as below:

import numpy as np
import dask.dataframe as dd
import dask_histogram.boost as dhb

# this is reproducible
d = {
    'A': np.random.normal(0., 1., 100000),
    'W': np.random.uniform(0.2, 0.8, 100000),
}
ddf = dd.from_dict(d, npartitions=10)

h = dhb.Histogram(
    dhb.axis.Regular(10, -3, 3),
    storage=dhb.storage.Weight()
).fill(ddf['A'], weight=ddf['W']).compute()
print(h)

This example gives me :

Traceback (most recent call last):
  File "/gpfs/home/belle2/rlebouch/darkphotontodimuons/background_rejection/testdask.py", line 15, in <module>
    ).fill(ddf['A'], weight=ddf['W']).compute()
                                      ^^^^^^^^^
  File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/base.py", line 372, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/base.py", line 653, in compute
    dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/base.py", line 422, in collections_to_dsk
    dsk = opt(dsk, keys, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask_histogram/core.py", line 514, in optimize
    dsk = fuse_roots(dsk, keys=keys)  # type: ignore
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/blockwise.py", line 1564, in fuse_roots
    new = toolz.merge(layer, *[layers[dep] for dep in deps])
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/toolz/dicttoolz.py", line 39, in merge
    rv.update(d)
  File "<frozen _collections_abc>", line 836, in __iter__
  File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/blockwise.py", line 641, in __iter__
    return iter(self._dict)
                ^^^^^^^^^^
  File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/blockwise.py", line 607, in _dict
    dsk = _make_blockwise_graph(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/blockwise.py", line 958, in _make_blockwise_graph
    itertools.product(*[range(dims[i]) for i in out_indices])
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/belle2/rlebouch/.local/lib/python3.11/site-packages/dask/blockwise.py", line 958, in <listcomp>
    itertools.product(*[range(dims[i]) for i in out_indices])
                              ~~~~^^^
KeyError: '.0'

Is It really possible to fill a histogram from a data frame?

I currently use:
Name: dask-histogram
Version: 2024.12.1

Name: dask
Version: 2024.12.1

Name: boost_histogram
Version: 1.4.1

@douglasdavis
Copy link
Collaborator

This problem stems from the new dask.dataframe backend that is based on dask-expr; dask-histogram isn't compatible at this time. More info here: #130

The code will work with the Dask config environment variable DASK_DATAFRAME__QUERY_PLANNING=False or with dask.config.set("dataframe.query-planning", False) in Python code.

@RobinTimTom
Copy link
Author

I added your suggestion to my code, but it solved nothing, and I still have the same error message.

@douglasdavis
Copy link
Collaborator

Can you share more details? Did you export the environment variable or use the dask.config API?

@RobinTimTom
Copy link
Author

I tried with the dask.config AP

@douglasdavis
Copy link
Collaborator

Hmm yeah I can only make it work with the env variable but not with the config; maybe it's an artifact of mixing dask-histogram & dask.dataframe, I'm not sure. That's probably another independent issue. But anyway, this is the workaround for now:

~/software/repos/dask-histogram main ❯ DASK_DATAFRAME__QUERY_PLANNING=False ipython                                                                       22s 󰌠 3.12.8 (dask-histogram) 󰊭 gitddavisdev 19:53:58
Python 3.12.8 (main, Dec  3 2024, 18:42:41) [Clang 16.0.0 (clang-1600.0.26.4)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.30.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np
   ...: import dask.dataframe as dd
   ...: import dask_histogram.boost as dhb
   ...:
   ...: # this is reproducible
   ...: d = {
   ...:     'A': np.random.normal(0., 1., 100000),
   ...:     'W': np.random.uniform(0.2, 0.8, 100000),
   ...: }
   ...: ddf = dd.from_dict(d, npartitions=10)
   ...:
   ...: h = dhb.Histogram(
   ...:     dhb.axis.Regular(10, -3, 3),
   ...:     storage=dhb.storage.Weight()
   ...: ).fill(ddf['A'], weight=ddf['W']).compute()
   ...: print(h)
/Users/ddavis/software/repos/dask-histogram/.venv/lib/python3.12/site-packages/dask/dataframe/__init__.py:31: FutureWarning: The legacy Dask DataFrame implementation is deprecated and will be removed in a future version. Set the configuration option `dataframe.query-planning` to `True` or None to enable the new Dask Dataframe implementation and silence this warning.
  warnings.warn(
                       ┌─────────────────────────────────────────────────────┐
[-inf,   -3) 66.92     │▎                                                    │
[  -3, -2.4) 357.9     │█▋                                                   │
[-2.4, -1.8) 1391      │██████▍                                              │
[-1.8, -1.2) 3997      │██████████████████▎                                  │
[-1.2, -0.6) 7929      │████████████████████████████████████▎                │
[-0.6,    0) 1.139e+04 │████████████████████████████████████████████████████ │
[   0,  0.6) 1.111e+04 │██████████████████████████████████████████████████▊  │
[ 0.6,  1.2) 8052      │████████████████████████████████████▊                │
[ 1.2,  1.8) 3914      │█████████████████▉                                   │
[ 1.8,  2.4) 1368      │██████▎                                              │
[ 2.4,    3) 324.1     │█▌                                                   │
[   3,  inf) 63.99     │▎                                                    │
                       └─────────────────────────────────────────────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants