-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Replace OutOfMemory exception with UnsupportedGPU exception #4620
Comments
We have similarly poor error messages for old CUDA versions, old driver versions, etc. that we should handle in one swoop. We should also strive to maintain allowing |
@lmeyerov what version are you on? |
We were getting reports from 0.7 -- 0.11. We're shipping 0.12 next week and then switch internal to 0.13 (0.13/0.14 should be faster b/c the 0.11 & 0.12 upgrades involved 100-200 unit tests & further automation around how we use it) |
Also, we always ship as docker, and in cloud cases but not on-prem, get to control the host as ubuntu 18 + whatever the aws+azure nvidia drivers are at that time |
If overhead on the check is a concern, another option that is fine for us as sw devs is explicit opt-in call. Ex: something like a That wouldn't help direct cudf users like data scientists tho. |
This is pretty straightforward. At
Does numba or cupy already wrap the appropriate APIs? Or would we need to do so in cuDF cython? To support @kkraus14's comment of being able to do |
Overhead on the check should be pretty low so I'm not too concerned.
Note 0.12 has a bunch of non-trivial memory overhead for Strings where you may just want to go from 0.11 --> 0.13 which has the overhead completely removed and additional memory usage improvements. |
We're stuck near-term on But yeah, the 0.12/0.13/0.14 upgrades seem to be battling overhead & memory issues we're seeing, so def excited! |
@jrhemstad Sanity check re:not importing, will doing submodule imports ( |
Yes it still will: https://docs.python.org/3/reference/import.html#regular-packages |
Describe the bug
We end up getting deployed in scenarios where cudf, on initialization, throws an RMM exception for out of memory during load, when in reality it is rejecting the available hardware.
This typically impacts our new-to-GPU users and even advanced ones during a config mistakes. E.g., on Azure, the default-available GPUs are K80s (until users jump quota hoops), so the typical Azure first-use experience is to spin up and get this misleading error. It's quite a tough and misleading experience for most people until they've been burnt enough.
We end up doing all sorts of things to try to get users to pick the right env etc. beforehand, but invariably, mistakes will happen, even by advanced users (misconfig, ...).
Not sure if this is better in
cudf
orrmm
.Steps/Code to reproduce bug
import cudf ; cudf.DataFrame({'x': [1]})
on an old popular GPU like the k80And/or on common other init next steps like
set_alloc
Expected behavior
Fail with
UnsupportedDeviceError
or something similarly indicativeEnvironment overview (please complete the following information)
Everywhere. We happen to hit it in Docker.
The text was updated successfully, but these errors were encountered: