Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid in-memory broadcasting when converting to_dask_dataframe #7472

Merged
merged 8 commits into from
Jan 26, 2023

Conversation

Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Jan 24, 2023

Turns out that there's a call to .set_dims that forces a broadcast on the numpy coordinates.

Debugging script:

import dask.array as da
import xarray as xr
import numpy as np

chunks = 5000

# I have to restart the pc if running with this:
# dim1_sz = 100_000
# dim2_sz = 100_000

# Does not crash when using the following constants, >5 gig RAM increase though:
dim1_sz = 40_000
dim2_sz = 40_000

x = da.random.random((dim1_sz, dim2_sz), chunks=chunks)

ds = xr.Dataset(
    {
        "x": xr.DataArray(
            data=x,
            dims=["dim1", "dim2"],
            coords={"dim1": np.arange(0, dim1_sz), "dim2": np.arange(0, dim2_sz)},
        )
    }
)

# with dask.config.set(**{"array.slicing.split_large_chunks": True}):
df = ds.to_dask_dataframe()
print(df)

@Illviljan Illviljan changed the title Avoid in-memory broadcasting when converting to dask_dataframe Avoid in-memory broadcasting when converting to_dask_dataframe Jan 24, 2023
dask_array = var.set_dims(ordered_dims).chunk(self.chunks).data
series = dd.from_array(dask_array.reshape(-1), columns=[name])
dask_array_raveled = ravel_chunks(dask_array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dask_array_raveled = ravel_chunks(dask_array)

Unfortunately we can't do this, at least not by default.

We could ask dask to add this behaviour as an opt-in kwarg for dask.dataframe.from_array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come?

If we go back to using .reshape(-1) or .ravel() we will continue getting this warning:

PerformanceWarning: Reshaping is producing a large chunk. To accept the large
chunk and silence this warning, set the option
with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    array.reshape(shape)

To avoid creating the large chunks, set the option
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    array.reshape(shape)Explictly passing ``limit`` to ``reshape`` will also silence this warning
array.reshape(shape, limit='128 MiB')
  exec_fun(compile(ast_code, filename, 'exec'), ns_globals, ns_locals)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reshape/ravel have an implied order. With this change the ordering of rows in the output dataframe depends on the chunking of the input array, which would be confusing as default behaviour

I think the warning is fine. Users can override with the dask context manager as suggested in the warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll undo it, it wasn't necessary for the real fix anyway.

For comparison, here's the df.visualization():

With reshape:
image

with reshape with context {'array.slicing.split_large_chunks': True}:
image

With ravel_chunks:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it's a big improvement. If you're dying to add it someplace, polyfit would be a good candidate (and very impactful PR).

@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Jan 24, 2023
xarray/core/dataset.py Outdated Show resolved Hide resolved
@Illviljan
Copy link
Contributor Author

Illviljan commented Jan 24, 2023

I like these kinds of improvements :)

With ravel_chunks:

       before           after         ratio
     [3ee7b5a6]       [e549724e]
-            983M             183M     0.19  pandas.ToDataFrameDask.peakmem_to_dataframe
-         2.76±0s      7.76±0.08ms     0.00  pandas.ToDataFrameDask.time_to_dataframe

With reshape

        before           after         ratio
     [3ee7b5a6]       [02a4e97f]
-            983M             183M     0.19  pandas.ToDataFrameDask.peakmem_to_dataframe
-         2.78±0s       9.20±0.1ms     0.00  pandas.ToDataFrameDask.time_to_dataframe

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Great PR!

@dcherian dcherian added the plan to merge Final call for comments label Jan 24, 2023
@dcherian dcherian merged commit d385e20 into pydata:main Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notebook crashes after calling .to_dask_dataframe
2 participants