Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider setting name=False in Variable.chunk() #1525

Open
shoyer opened this issue Aug 24, 2017 · 4 comments
Open

Consider setting name=False in Variable.chunk() #1525

shoyer opened this issue Aug 24, 2017 · 4 comments

Comments

@shoyer
Copy link
Member

shoyer commented Aug 24, 2017

@mrocklin writes:

The following will be slower:

b = (a.chunk(...) + 1) + (a.chunk(...) + 1)

In current operation this will be optimized to

tmp = a.chunk(...) + 1
b = tmp + tmp

So you'll lose that, but I suspect that in your case chunking the same dataset many times is somewhat rare.

See here for discussion: #1517 (comment)

Whether this is worth doing really depends on on what people would find most useful -- and what is the most intuitive behavior.

@mrocklin
Copy link
Contributor

To be explicit, by default da.from_array currently names arrays by hashing all of the data within them. This can be somewhat slow depending on what hashing libraries you have on your machine, generally something like 500-1000 MB/s. This buys you a deterministic name for your array. If someone else with the exact same data does the exact same operations that then Dask can track that and avoid repeated work.

So you have to choose:

  1. Avoid repeated work
  2. Avoid hashing data

The choice really depends on how often you plan to repeat the same computation on the same data that comes from the same numpy array. If you only ever call a.chunk(...) once per array then there is no reason to hash.

@stale
Copy link

stale bot commented Jul 25, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jul 25, 2019
@jhamman
Copy link
Member

jhamman commented Jul 25, 2019

I think we should leave this open as we never came to any consensus on the best path forward.

@stale stale bot removed the stale label Jul 25, 2019
@stale
Copy link

stale bot commented Jul 13, 2021

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jul 13, 2021
dcherian added a commit to dcherian/xarray that referenced this issue Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants