Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask.base.tokenize() fails when one of the arguments is a Quantity #1313

Closed
mnlevy1981 opened this issue May 21, 2021 · 7 comments · Fixed by #1314
Closed

dask.base.tokenize() fails when one of the arguments is a Quantity #1313

mnlevy1981 opened this issue May 21, 2021 · 7 comments · Fixed by #1314

Comments

@mnlevy1981
Copy link

Something changed between 0.14 and 0.15 (#1151 seems like a likely culprit based on the issue it fixed), and now passing Pint.Quantity variables to dask.base.tokenize() fails.

Here's a simple script:

#!/usr/bin/env python

import dask
import pint
import platform

def get_token(func, *args, **kwargs):
    return dask.base.tokenize(func, args, kwargs)

print(f"python version: {platform.python_version()}")
print(f"dask version: {dask.__version__}")
print(f"pint version: {pint.__version__}")

print(f"Token with integer: {get_token(get_token, 1)}")
meter = pint.UnitRegistry().meter
print(f"Token with {type(meter)}: {get_token(get_token, meter)}")
one_meter = 1*meter
print(f"Token with {type(one_meter)}: {get_token(get_token, one_meter)}")

It runs fine with 0.14:

python version: 3.7.10
dask version: 2021.05.0
pint version: 0.14
Token with integer: b42aee4397f136cc5ad5d47dfec43b4a
Token with <class 'pint.unit.build_unit_class.<locals>.Unit'>: ccc2ff854733bde08b7d3673fe8c0831
Token with <class 'pint.quantity.build_quantity_class.<locals>.Quantity'>: 6859cfdc6ac62f5ec8597128645644f7

but not 0.15:

python version: 3.7.10
dask version: 2021.05.0
pint version: 0.15
Token with integer: b42aee4397f136cc5ad5d47dfec43b4a
Token with <class 'pint.unit.build_unit_class.<locals>.Unit'>: c666c4697bd230072ee1d38fa4f0355e
Traceback (most recent call last):
  File "./dask_and_pint.py", line 18, in <module>
    print(f"Token with {type(one_meter)}: {get_token(get_token, one_meter)}")
  File "./dask_and_pint.py", line 8, in get_token
    return dask.base.tokenize(func, args, kwargs)
  File "/glade/work/mlevy/miniconda3/envs/daskbug/lib/python3.7/site-packages/dask/base.py", line 795, in tokenize
    return md5(str(tuple(map(normalize_token, args))).encode()).hexdigest()
  File "/glade/work/mlevy/miniconda3/envs/daskbug/lib/python3.7/site-packages/dask/utils.py", line 512, in __call__
    return meth(arg, *args, **kwargs)
  File "/glade/work/mlevy/miniconda3/envs/daskbug/lib/python3.7/site-packages/dask/base.py", line 827, in normalize_seq
    return type(seq).__name__, func(seq)
  File "/glade/work/mlevy/miniconda3/envs/daskbug/lib/python3.7/site-packages/dask/base.py", line 823, in func
    return list(map(normalize_token, seq))
  File "/glade/work/mlevy/miniconda3/envs/daskbug/lib/python3.7/site-packages/dask/utils.py", line 512, in __call__
    return meth(arg, *args, **kwargs)
  File "/glade/work/mlevy/miniconda3/envs/daskbug/lib/python3.7/site-packages/dask/base.py", line 844, in normalize_object
    return method()
  File "/glade/work/mlevy/miniconda3/envs/daskbug/lib/python3.7/site-packages/pint/quantity.py", line 1971, in __dask_tokenize__
    return (Quantity, self._magnitude.name, self.units)
AttributeError: 'int' object has no attribute 'name'

It looks like #1151 introduced a test_dask_tokenize() test that is presumably passing, so there's a pretty good chance that I'm just not using pint correctly in a dask setting :)

@keewis
Copy link
Contributor

keewis commented May 21, 2021

that indeed seems like a bug, the code currently assumes that if you're calling tokenize on a pint array it must be wrapping a dask array, which is obviously not the case in your example (and I would guess the same happens for other dask protocols).

Although it doesn't change much in this case, please note that the most recent version of pint is 0.17.

@jthielen
Copy link
Contributor

Thanks for raising this issue!

This brings up an interesting point: is __dask_tokenize__ solely part of the "duck dask Array" API (in which case failing when not wrapping a dask array seems like the right behavior, but perhaps a more informative error is in order), or should tokenization work in general (in which case we should be falling back to some unique identifier other than the name attribute of the magnitude when it isn't there). I'd definitely lean towards the latter. @keewis and @rpmanser do you have thoughts here?

@mnlevy1981 as far as your point on "not using pint correctly in a dask setting", without seeing your usage, all I can say is that the main rule is that Dask Arrays are intended to go inside Pint Quantities and not the other way around. https://pint.readthedocs.io/en/stable/numpy.html#Array-Type-Support and dask/dask#6393 may provide some informative reading if you'd like to dig deeper.

@mnlevy1981
Copy link
Author

Although it doesn't change much in this case, please note that the most recent version of pint is 0.17.

This error does still appear in the latest version, but I had a notebook that used to work so I just backed up to like 0.10 and then stepped forward until I found the last working version. I haven't actually run it in 0.16, but it's definitely broken in 0.15 and 0.17

@mnlevy1981 as far as your point on "not using pint correctly in a dask setting", without seeing your usage, all I can say is that the main rule is that Dask Arrays are intended to go inside Pint Quantities and not the other way around.

It's been a while since I wrote the code, but I believe I have a dictionary of various unit conversion factors that are all [scalar] Pint Quantities and then I'm multiplying several dask arrays by the appropriate dictionary entry. The dask arrays are actually xr.DataArray.values but I haven't had a chance to explore pint-xarray yet

@keewis
Copy link
Contributor

keewis commented May 21, 2021

well, if dask.base.tokenize(1) seems like something valid (and it seems it is) then I'd argue that dask.base.tokenize(1 * ureg.m) should also be valid.

As for the unique identifier: if it's not too expensive should we just delegate to the magnitude (i.e. dask.base.tokenize(self._magnitude))?

Edit: that way, __dask_tokenize__ would return (Quantity, dask.base.tokenize(self._magnitude), self.units)

@keewis
Copy link
Contributor

keewis commented May 21, 2021

I haven't had a chance to explore pint-xarray yet

Once you do, I'd love to see how you use it and where the pain points currently are (if any).

@rpmanser
Copy link
Contributor

I see you already opened a PR @keewis, thanks. I'll still chime in briefly.

As for the unique identifier: if it's not too expensive should we just delegate to the magnitude (i.e. dask.base.tokenize(self._magnitude))?

Edit: that way, __dask_tokenize__ would return (Quantity, dask.base.tokenize(self._magnitude), self.units)

Would a particularly large, say 5 GB or larger, numpy array be too expensive? I imagine tokenizing doesn't read the whole array, just its memory location, but I'm not completely familiar with the process.

@keewis
Copy link
Contributor

keewis commented May 24, 2021

If I read the code for dask.base.register_numpy correctly, it shouldn't be too expensive. This seems to support that view (I can't check 5GB arrays, I don't have that much memory):

In [2]: from dask.base import tokenize
   ...: import numpy as np
   ...: 
   ...: for size in (100, 10000, 100000000):
   ...:     a = np.linspace(0, 1, size)
   ...:     print("size:", a.nbytes)
   ...:     %timeit tokenize(a)
   ...: 
size: 800
23.8 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
size: 80000
34.4 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
size: 800000000
105 ms ± 8.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants