Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dask_cudf.repartition fails with OOM very often #178

Closed
randerzander opened this issue Nov 20, 2019 · 6 comments
Closed

[BUG] dask_cudf.repartition fails with OOM very often #178

randerzander opened this issue Nov 20, 2019 · 6 comments

Comments

@randerzander
Copy link
Contributor

randerzander commented Nov 20, 2019

Repartitioning about 350GB of CSV files (50 uncompressed files, each about 7gb) causes my dask-cuda cluster of 8 16 GB GPUs to fail with an OOM.

I'm attempting to use npartitions=100 (3.5gb/partition) which I wouldn't expect to tax individual workers to the point of causing OOMs.

Am I thinking about this correctly?

@randerzander
Copy link
Contributor Author

Counter-intuitively, setting chunksize=None on read_csv eliminates the problem for some smaller datasets.

Concatenating fewer large DataFrames seems to be easier on GPU memory than many smaller concatenations (which happens with the default parallel CSV reader).

Even with chunksize=None I'm failing to repartition the table described above.

@mrocklin
Copy link
Contributor

3.5gb/partition

is this on-disk size or in-memory size? These may differ considerably. CSV is actually a decently space-efficient format relative to common in-memory representations.

Even with chunksize=None I'm failing to repartition the table described above.

My first step here would be to watch the dashboard to see what is going on.

@pentschev
Copy link
Member

I would also suggest setting device_memory_limit, if this is not already the case, to allow spilling of Dask memory to host. One more thing is that Dask doesn't control any of the memory utilized by cuDF, so if it turns out that cuDF is consuming too much memory in this situation, the only alternative we have is to configure RMM to use managed memory, an example of that can be seen in #57 (comment).

@pentschev
Copy link
Member

Is this still relevant @randerzander ?

@jakirkham
Copy link
Member

Friendly nudge @randerzander 😉

@pentschev
Copy link
Member

Given TPCx-BB effort was successful, I'm assuming this has been resolved/improved, I'm tentatively closing this but feel free to reopen if this is observed again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants