-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce dask tokenization time #8339
Conversation
This is a great catch @martindurant ! I've seen the tokenization slowness as well and this will be a great improvement! Hmm, I was expecting this benchmark to get better: xarray/asv_bench/benchmarks/dataset_io.py Line 527 in 1e8f618
Something to ponder for another day. |
Done.
It would have an effect if n_chunk * n_variables is large (and only on the dask path) |
Well this benchmark didn't change. xarray/asv_bench/benchmarks/dataset.py Lines 23 to 32 in 8dddbca
Can you propose an improvement? |
That's for rechunk rather than open? I'm not sure it calls this code. |
|
Hmmm I think I missed it earlier now I see
and
which is under the 50% reporting threshold we set but still great. |
When using dask (e.g.,
chunks={}
with a zarr dataset), each dask.array gets a token. Calculating this token currently hits a recursive path within dask and is relatively slow (~10ms), which adds up for many variables. This PR makes a simpler but still unique token.An example profile of open_dataset before:

and after
