-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Consider increasing default host memory limit per dask-cuda-worker #169
Comments
@randerzander we started the conversation about this offline, could you add a bit more context/examples of when things fail for you? Also, based on the experience you had, how did you setup the memory limit that generally worked? cc @mrocklin for visibility |
cc @quasiben for visibility |
@randerzander @beckernick @VibhuJawa is this still relevant? Is there some additional information you could share as to what would better defaults look like? |
Friendly nudge @randerzander @beckernick @VibhuJawa 😉 |
Thanks for the bump John. Anecdotally, we find that the most effective setup includes setting the host memory limit as the maximum available system memory ( |
IMO, this would be too dangerous for a default. It seems that this was the best setup for TPCx-BB which was running in an exclusive environment, but this is not gonna be the case for every dask-cuda user, for instance running such a setup on a desktop being shared with other running applications may render the system very unstable due to main memory going completely full. |
I agree with Peter here. What's most effective for a given workflow doesn't necessarily translate to what's most effective for a default. A quick thought, though: Naively, I'd expect Dask to start spilling at 60/70% host memory capacity, and then terminate at 95%. This feels to me like a good default for termination. We've made a lot of changes since last November. Is exceeding host memory while reading large files still as big of an issue? Is it possible this was related to spilling issues rather than host memory capacity issues? |
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d. |
This issue has been labeled |
Several users have reported problems where
dask-cuda-worker
processes die in unexpected ways. After some debugging they find it's due to exceeding host memory limits, particularly when loading large training sets into GPU memory.This is surprising for users, as it's not clear when or how a significant amount of host memory might be used, especially considering RAPIDS projects are focused on running as much as possible on GPUs.
The text was updated successfully, but these errors were encountered: