Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce dask tokenization time #8339

Merged
merged 4 commits into from
Oct 20, 2023
Merged

Reduce dask tokenization time #8339

merged 4 commits into from
Oct 20, 2023

Conversation

martindurant
Copy link
Contributor

When using dask (e.g., chunks={} with a zarr dataset), each dask.array gets a token. Calculating this token currently hits a recursive path within dask and is relatively slow (~10ms), which adds up for many variables. This PR makes a simpler but still unique token.

An example profile of open_dataset before:
Screenshot 2023-10-19 at 13 19 43

and after
Screenshot 2023-10-19 at 13 19 52

@Illviljan
Copy link
Contributor

This is a great catch @martindurant ! I've seen the tokenization slowness as well and this will be a great improvement!

Hmm, I was expecting this benchmark to get better:

class IOReadCustomEngine:

Something to ponder for another day.

@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Oct 19, 2023
@martindurant
Copy link
Contributor Author

Can you add a comment

Done.

I was expecting this benchmark to get better:

It would have an effect if n_chunk * n_variables is large (and only on the dask path)

@dcherian
Copy link
Contributor

Well this benchmark didn't change.

class DatasetChunk:
def setup(self):
requires_dask()
self.ds = Dataset()
array = np.ones(1000)
for i in range(250):
self.ds[f"var{i}"] = ("x", array)
def time_chunk(self):
self.ds.chunk(x=(1,) * 1000)

Can you propose an improvement?

@martindurant
Copy link
Contributor Author

Can you propose an improvement?

That's for rechunk rather than open? I'm not sure it calls this code.

@dcherian
Copy link
Contributor

_maybe_chunk is shared between the two code paths AFAICT.

@dcherian
Copy link
Contributor

dcherian commented Oct 20, 2023

Hmmm I think I missed it earlier now I see

[52.71%] ··· dataset.DatasetChunk.time_chunk                            269±1ms # HEAD
[77.71%] ··· dataset.DatasetChunk.time_chunk                            402±1ms # main

and

[52.71%] ··· dataset.DatasetChunk.time_chunk                            375±2ms # HEAD
[77.71%] ··· dataset.DatasetChunk.time_chunk                            539±4ms # main

which is under the 50% reporting threshold we set but still great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run the ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants