Reduce dask tokenization time #8339

martindurant · 2023-10-19T17:22:06Z

When using dask (e.g., chunks={} with a zarr dataset), each dask.array gets a token. Calculating this token currently hits a recursive path within dask and is relatively slow (~10ms), which adds up for many variables. This PR makes a simpler but still unique token.

An example profile of open_dataset before:

and after

xarray/core/dataset.py

Illviljan · 2023-10-19T19:38:56Z

This is a great catch @martindurant ! I've seen the tokenization slowness as well and this will be a great improvement!

Hmm, I was expecting this benchmark to get better:

xarray/asv_bench/benchmarks/dataset_io.py

Line 527 in 1e8f618

class IOReadCustomEngine:

Something to ponder for another day.

martindurant · 2023-10-20T13:31:22Z

Can you add a comment

Done.

I was expecting this benchmark to get better:

It would have an effect if n_chunk * n_variables is large (and only on the dask path)

xref pydata#8339 xref pydata#8350

xref #8339 xref #8350

dcherian · 2023-10-20T19:47:14Z

Well this benchmark didn't change.

xarray/asv_bench/benchmarks/dataset.py

Lines 23 to 32 in 8dddbca

    
           class DatasetChunk: 
        
               def setup(self): 
        
                   requires_dask() 
        
                   self.ds = Dataset() 
        
                   array = np.ones(1000) 
        
                   for i in range(250): 
        
                       self.ds[f"var{i}"] = ("x", array) 
        
               def time_chunk(self): 
        
                   self.ds.chunk(x=(1,) * 1000)

Can you propose an improvement?

martindurant · 2023-10-20T20:42:52Z

Can you propose an improvement?

That's for rechunk rather than open? I'm not sure it calls this code.

dcherian · 2023-10-20T20:53:39Z

_maybe_chunk is shared between the two code paths AFAICT.

dcherian · 2023-10-20T23:12:24Z

Hmmm I think I missed it earlier now I see

[52.71%] ··· dataset.DatasetChunk.time_chunk                            269±1ms # HEAD
[77.71%] ··· dataset.DatasetChunk.time_chunk                            402±1ms # main

and

[52.71%] ··· dataset.DatasetChunk.time_chunk                            375±2ms # HEAD
[77.71%] ··· dataset.DatasetChunk.time_chunk                            539±4ms # main

which is under the 50% reporting threshold we set but still great.

Reduce dask tokenization time

80ff878

dcherian reviewed Oct 19, 2023

View reviewed changes

xarray/core/dataset.py Show resolved Hide resolved

dcherian added topic-performance topic-dask labels Oct 19, 2023

Add comment

ec01817

github-actions bot removed topic-dask topic-performance labels Oct 19, 2023

Illviljan added the run-benchmark Run the ASV benchmark workflow label Oct 19, 2023

martindurant mentioned this pull request Oct 20, 2023

open in xarray without dask? intake/intake-xarray#138

Closed

dcherian added a commit to dcherian/xarray that referenced this pull request Oct 20, 2023

[skip-ci] Add benchmarks for Dataset binary ops, chunk

cf4092a

xref pydata#8339 xref pydata#8350

dcherian mentioned this pull request Oct 20, 2023

[skip-ci] Add benchmarks for Dataset binary ops, chunk #8351

Merged

Illviljan mentioned this pull request Oct 20, 2023

Add better ASV test cases for open_dataset #8352

Merged

dcherian added a commit that referenced this pull request Oct 20, 2023

[skip-ci] Add benchmarks for Dataset binary ops, chunk (#8351)

9517b60

xref #8339 xref #8350

dcherian and others added 2 commits October 20, 2023 12:09

Merge branch 'main' into open_opt

51e0151

Merge branch 'main' into open_opt

5bb2969

dcherian merged commit 86b4167 into pydata:main Oct 20, 2023

martindurant mentioned this pull request Oct 26, 2023

xarray.open_zarr() takes too long to lazy load when the data arrays contain a large number of Dask chunks. #6036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce dask tokenization time #8339

Reduce dask tokenization time #8339

martindurant commented Oct 19, 2023

Illviljan commented Oct 19, 2023

martindurant commented Oct 20, 2023

dcherian commented Oct 20, 2023

martindurant commented Oct 20, 2023

dcherian commented Oct 20, 2023

dcherian commented Oct 20, 2023 •

edited

Loading

Reduce dask tokenization time #8339

Reduce dask tokenization time #8339

Conversation

martindurant commented Oct 19, 2023

Illviljan commented Oct 19, 2023

martindurant commented Oct 20, 2023

dcherian commented Oct 20, 2023

martindurant commented Oct 20, 2023

dcherian commented Oct 20, 2023

dcherian commented Oct 20, 2023 • edited Loading

dcherian commented Oct 20, 2023 •

edited

Loading