Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Consider increasing default host memory limit per dask-cuda-worker #169

Open
randerzander opened this issue Nov 6, 2019 · 9 comments
Labels
inactive-30d inactive-90d proposal A code change suggestion to be considered or discussed

Comments

@randerzander
Copy link
Contributor

Several users have reported problems where dask-cuda-worker processes die in unexpected ways. After some debugging they find it's due to exceeding host memory limits, particularly when loading large training sets into GPU memory.

This is surprising for users, as it's not clear when or how a significant amount of host memory might be used, especially considering RAPIDS projects are focused on running as much as possible on GPUs.

@pentschev
Copy link
Member

@randerzander we started the conversation about this offline, could you add a bit more context/examples of when things fail for you? Also, based on the experience you had, how did you setup the memory limit that generally worked?

cc @mrocklin for visibility

@mrocklin
Copy link
Contributor

cc @quasiben for visibility

@pentschev
Copy link
Member

@randerzander @beckernick @VibhuJawa is this still relevant? Is there some additional information you could share as to what would better defaults look like?

@jakirkham
Copy link
Member

Friendly nudge @randerzander @beckernick @VibhuJawa 😉

@beckernick
Copy link
Member

Thanks for the bump John. Anecdotally, we find that the most effective setup includes setting the host memory limit as the maximum available system memory ((free -m | awk '/^Mem:/{print $2}')). I'm interested to hear if folks thinks the system maximum is too high for a default.

@pentschev
Copy link
Member

IMO, this would be too dangerous for a default. It seems that this was the best setup for TPCx-BB which was running in an exclusive environment, but this is not gonna be the case for every dask-cuda user, for instance running such a setup on a desktop being shared with other running applications may render the system very unstable due to main memory going completely full.

@beckernick
Copy link
Member

IMO, this would be too dangerous for a default. ...

I agree with Peter here. What's most effective for a given workflow doesn't necessarily translate to what's most effective for a default. A quick thought, though:

Naively, I'd expect Dask to start spilling at 60/70% host memory capacity, and then terminate at 95%. This feels to me like a good default for termination. We've made a lot of changes since last November. Is exceeding host memory while reading large files still as big of an issue? Is it possible this was related to spilling issues rather than host memory capacity issues?

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive-30d inactive-90d proposal A code change suggestion to be considered or discussed
Projects
Status: No status
Development

No branches or pull requests

5 participants