Avoid in-memory broadcasting when converting to_dask_dataframe #7472

Illviljan · 2023-01-24T00:15:01Z

Turns out that there's a call to .set_dims that forces a broadcast on the numpy coordinates.

Closes Notebook crashes after calling .to_dask_dataframe #6811
Tests added, see Add benchmarks for to_dataframe and to_dask_dataframe #7474.
User visible changes (including notable bug fixes) are documented in whats-new.rst

Debugging script:

import dask.array as da
import xarray as xr
import numpy as np

chunks = 5000

# I have to restart the pc if running with this:
# dim1_sz = 100_000
# dim2_sz = 100_000

# Does not crash when using the following constants, >5 gig RAM increase though:
dim1_sz = 40_000
dim2_sz = 40_000

x = da.random.random((dim1_sz, dim2_sz), chunks=chunks)

ds = xr.Dataset(
    {
        "x": xr.DataArray(
            data=x,
            dims=["dim1", "dim2"],
            coords={"dim1": np.arange(0, dim1_sz), "dim2": np.arange(0, dim2_sz)},
        )
    }
)

# with dask.config.set(**{"array.slicing.split_large_chunks": True}):
df = ds.to_dask_dataframe()
print(df)

dcherian · 2023-01-24T17:42:03Z

xarray/core/dataset.py

            dask_array = var.set_dims(ordered_dims).chunk(self.chunks).data
-            series = dd.from_array(dask_array.reshape(-1), columns=[name])
+            dask_array_raveled = ravel_chunks(dask_array)


Suggested change

dask_array_raveled = ravel_chunks(dask_array)

Unfortunately we can't do this, at least not by default.

We could ask dask to add this behaviour as an opt-in kwarg for dask.dataframe.from_array

How come?

If we go back to using .reshape(-1) or .ravel() we will continue getting this warning:

PerformanceWarning: Reshaping is producing a large chunk. To accept the large chunk and silence this warning, set the option with dask.config.set(**{'array.slicing.split_large_chunks': False}): array.reshape(shape) To avoid creating the large chunks, set the option with dask.config.set(**{'array.slicing.split_large_chunks': True}): array.reshape(shape)Explictly passing ``limit`` to ``reshape`` will also silence this warning array.reshape(shape, limit='128 MiB') exec_fun(compile(ast_code, filename, 'exec'), ns_globals, ns_locals)

reshape/ravel have an implied order. With this change the ordering of rows in the output dataframe depends on the chunking of the input array, which would be confusing as default behaviour

I think the warning is fine. Users can override with the dask context manager as suggested in the warning.

Ok, I'll undo it, it wasn't necessary for the real fix anyway.

For comparison, here's the df.visualization():

With reshape:

with reshape with context {'array.slicing.split_large_chunks': True}:

With ravel_chunks:

Yup, it's a big improvement. If you're dying to add it someplace, polyfit would be a good candidate (and very impactful PR).

xarray/core/dataset.py

Co-authored-by: Deepak Cherian <[email protected]>

Illviljan · 2023-01-24T21:15:37Z

I like these kinds of improvements :)

With ravel_chunks:

       before           after         ratio
     [3ee7b5a6]       [e549724e]
-            983M             183M     0.19  pandas.ToDataFrameDask.peakmem_to_dataframe
-         2.76±0s      7.76±0.08ms     0.00  pandas.ToDataFrameDask.time_to_dataframe

With reshape

        before           after         ratio
     [3ee7b5a6]       [02a4e97f]
-            983M             183M     0.19  pandas.ToDataFrameDask.peakmem_to_dataframe
-         2.78±0s       9.20±0.1ms     0.00  pandas.ToDataFrameDask.time_to_dataframe

dcherian

Thanks. Great PR!

Illviljan added 2 commits January 24, 2023 01:13

Avoid in-memory broadcasting when converting to dask_dataframe

a00b6b1

Update dataset.py

197ce32

Illviljan changed the title ~~Avoid in-memory broadcasting when converting to dask_dataframe~~ Avoid in-memory broadcasting when converting to_dask_dataframe Jan 24, 2023

dcherian reviewed Jan 24, 2023

View reviewed changes

xarray/core/dataset.py Show resolved Hide resolved

Illviljan mentioned this pull request Jan 24, 2023

Add benchmarks for to_dataframe and to_dask_dataframe #7474

Merged

Merge branch 'main' into to_dask_dataframe_not_crash

fd99c3b

Illviljan added the run-benchmark Run the ASV benchmark workflow label Jan 24, 2023

dcherian reviewed Jan 24, 2023

View reviewed changes

xarray/core/dataset.py Outdated Show resolved Hide resolved

Illviljan and others added 3 commits January 24, 2023 21:29

Update xarray/core/dataset.py

b880bdf

Co-authored-by: Deepak Cherian <[email protected]>

Merge branch 'main' into to_dask_dataframe_not_crash

842dbc7

Update whats-new.rst

d0afa5a

Illviljan added 2 commits January 24, 2023 22:16

remove ravel_chunks

f340b09

Update dataset.py

04173c2

dcherian approved these changes Jan 24, 2023

View reviewed changes

dcherian added the plan to merge Final call for comments label Jan 24, 2023

dcherian merged commit d385e20 into pydata:main Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid in-memory broadcasting when converting to_dask_dataframe #7472

Avoid in-memory broadcasting when converting to_dask_dataframe #7472

Illviljan commented Jan 24, 2023 •

edited

Loading

dcherian Jan 24, 2023

Illviljan Jan 24, 2023

dcherian Jan 24, 2023

Illviljan Jan 24, 2023

dcherian Jan 24, 2023

Illviljan commented Jan 24, 2023 •

edited

Loading

dcherian left a comment

Avoid in-memory broadcasting when converting to_dask_dataframe #7472

Avoid in-memory broadcasting when converting to_dask_dataframe #7472

Conversation

Illviljan commented Jan 24, 2023 • edited Loading

dcherian Jan 24, 2023

Choose a reason for hiding this comment

Illviljan Jan 24, 2023

Choose a reason for hiding this comment

dcherian Jan 24, 2023

Choose a reason for hiding this comment

Illviljan Jan 24, 2023

Choose a reason for hiding this comment

dcherian Jan 24, 2023

Choose a reason for hiding this comment

Illviljan commented Jan 24, 2023 • edited Loading

dcherian left a comment

Choose a reason for hiding this comment

Illviljan commented Jan 24, 2023 •

edited

Loading

Illviljan commented Jan 24, 2023 •

edited

Loading