[DISCUSSION] Consider increasing default host memory limit per dask-cuda-worker #169

randerzander · 2019-11-06T17:05:29Z

Several users have reported problems where dask-cuda-worker processes die in unexpected ways. After some debugging they find it's due to exceeding host memory limits, particularly when loading large training sets into GPU memory.

This is surprising for users, as it's not clear when or how a significant amount of host memory might be used, especially considering RAPIDS projects are focused on running as much as possible on GPUs.

The text was updated successfully, but these errors were encountered:

pentschev · 2019-11-08T17:12:18Z

@randerzander we started the conversation about this offline, could you add a bit more context/examples of when things fail for you? Also, based on the experience you had, how did you setup the memory limit that generally worked?

cc @mrocklin for visibility

mrocklin · 2019-11-22T19:18:27Z

cc @quasiben for visibility

pentschev · 2020-05-05T22:23:39Z

@randerzander @beckernick @VibhuJawa is this still relevant? Is there some additional information you could share as to what would better defaults look like?

jakirkham · 2020-07-02T00:20:21Z

Friendly nudge @randerzander @beckernick @VibhuJawa 😉

beckernick · 2020-07-02T02:02:57Z

Thanks for the bump John. Anecdotally, we find that the most effective setup includes setting the host memory limit as the maximum available system memory ((free -m | awk '/^Mem:/{print $2}')). I'm interested to hear if folks thinks the system maximum is too high for a default.

pentschev · 2020-07-02T07:16:17Z

IMO, this would be too dangerous for a default. It seems that this was the best setup for TPCx-BB which was running in an exclusive environment, but this is not gonna be the case for every dask-cuda user, for instance running such a setup on a desktop being shared with other running applications may render the system very unstable due to main memory going completely full.

beckernick · 2020-07-02T14:48:10Z

IMO, this would be too dangerous for a default. ...

I agree with Peter here. What's most effective for a given workflow doesn't necessarily translate to what's most effective for a default. A quick thought, though:

Naively, I'd expect Dask to start spilling at 60/70% host memory capacity, and then terminate at 95%. This feels to me like a good default for termination. We've made a lot of changes since last November. Is exceeding host memory while reading large files still as big of an issue? Is it possible this was related to spilling issues rather than host memory capacity issues?

github-actions · 2021-02-16T19:09:15Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

github-actions · 2021-05-17T19:10:48Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

pentschev added the proposal A code change suggestion to be considered or discussed label Jan 8, 2021

randerzander mentioned this issue Jan 20, 2021

Evaluate whether setting host memory limits is still necessary or beneficial rapidsai/gpu-bdb#167

Open

github-actions bot added the inactive-30d label Feb 16, 2021

github-actions bot added the inactive-90d label May 17, 2021

caryr35 added this to dask-cuda Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Consider increasing default host memory limit per dask-cuda-worker #169

[DISCUSSION] Consider increasing default host memory limit per dask-cuda-worker #169

randerzander commented Nov 6, 2019

pentschev commented Nov 8, 2019

mrocklin commented Nov 22, 2019

pentschev commented May 5, 2020

jakirkham commented Jul 2, 2020

beckernick commented Jul 2, 2020

pentschev commented Jul 2, 2020

beckernick commented Jul 2, 2020

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021

[DISCUSSION] Consider increasing default host memory limit per dask-cuda-worker #169

[DISCUSSION] Consider increasing default host memory limit per dask-cuda-worker #169

Comments

randerzander commented Nov 6, 2019

pentschev commented Nov 8, 2019

mrocklin commented Nov 22, 2019

pentschev commented May 5, 2020

jakirkham commented Jul 2, 2020

beckernick commented Jul 2, 2020

pentschev commented Jul 2, 2020

beckernick commented Jul 2, 2020

github-actions bot commented Feb 16, 2021

github-actions bot commented May 17, 2021