[BUG] dask_cudf.repartition fails with OOM very often #178

randerzander · 2019-11-20T01:38:35Z

Repartitioning about 350GB of CSV files (50 uncompressed files, each about 7gb) causes my dask-cuda cluster of 8 16 GB GPUs to fail with an OOM.

I'm attempting to use npartitions=100 (3.5gb/partition) which I wouldn't expect to tax individual workers to the point of causing OOMs.

Am I thinking about this correctly?

The text was updated successfully, but these errors were encountered:

randerzander · 2019-11-20T02:17:34Z

Counter-intuitively, setting chunksize=None on read_csv eliminates the problem for some smaller datasets.

Concatenating fewer large DataFrames seems to be easier on GPU memory than many smaller concatenations (which happens with the default parallel CSV reader).

Even with chunksize=None I'm failing to repartition the table described above.

mrocklin · 2019-11-20T05:56:30Z

3.5gb/partition

is this on-disk size or in-memory size? These may differ considerably. CSV is actually a decently space-efficient format relative to common in-memory representations.

Even with chunksize=None I'm failing to repartition the table described above.

My first step here would be to watch the dashboard to see what is going on.

pentschev · 2019-11-20T13:02:46Z

I would also suggest setting device_memory_limit, if this is not already the case, to allow spilling of Dask memory to host. One more thing is that Dask doesn't control any of the memory utilized by cuDF, so if it turns out that cuDF is consuming too much memory in this situation, the only alternative we have is to configure RMM to use managed memory, an example of that can be seen in #57 (comment).

pentschev · 2020-05-05T22:15:15Z

Is this still relevant @randerzander ?

jakirkham · 2020-07-02T00:22:23Z

Friendly nudge @randerzander 😉

pentschev · 2020-11-30T13:21:02Z

Given TPCx-BB effort was successful, I'm assuming this has been resolved/improved, I'm tentatively closing this but feel free to reopen if this is observed again.

pentschev closed this as completed Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] dask_cudf.repartition fails with OOM very often #178

[BUG] dask_cudf.repartition fails with OOM very often #178

randerzander commented Nov 20, 2019 •

edited

Loading

randerzander commented Nov 20, 2019

mrocklin commented Nov 20, 2019

pentschev commented Nov 20, 2019

pentschev commented May 5, 2020

jakirkham commented Jul 2, 2020

pentschev commented Nov 30, 2020

[BUG] dask_cudf.repartition fails with OOM very often #178

[BUG] dask_cudf.repartition fails with OOM very often #178

Comments

randerzander commented Nov 20, 2019 • edited Loading

randerzander commented Nov 20, 2019

mrocklin commented Nov 20, 2019

pentschev commented Nov 20, 2019

pentschev commented May 5, 2020

jakirkham commented Jul 2, 2020

pentschev commented Nov 30, 2020

randerzander commented Nov 20, 2019 •

edited

Loading