Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Replace OutOfMemory exception with UnsupportedGPU exception #4620

Closed
lmeyerov opened this issue Mar 20, 2020 · 10 comments · Fixed by #4692
Closed

[DOC] Replace OutOfMemory exception with UnsupportedGPU exception #4620

lmeyerov opened this issue Mar 20, 2020 · 10 comments · Fixed by #4692
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@lmeyerov
Copy link

Describe the bug

We end up getting deployed in scenarios where cudf, on initialization, throws an RMM exception for out of memory during load, when in reality it is rejecting the available hardware.

This typically impacts our new-to-GPU users and even advanced ones during a config mistakes. E.g., on Azure, the default-available GPUs are K80s (until users jump quota hoops), so the typical Azure first-use experience is to spin up and get this misleading error. It's quite a tough and misleading experience for most people until they've been burnt enough.

We end up doing all sorts of things to try to get users to pick the right env etc. beforehand, but invariably, mistakes will happen, even by advanced users (misconfig, ...).

Not sure if this is better in cudf or rmm.

Steps/Code to reproduce bug

import cudf ; cudf.DataFrame({'x': [1]}) on an old popular GPU like the k80

And/or on common other init next steps like set_alloc

Expected behavior

Fail with UnsupportedDeviceError or something similarly indicative

Environment overview (please complete the following information)

Everywhere. We happen to hit it in Docker.

@lmeyerov lmeyerov added Needs Triage Need team to review and classify bug Something isn't working labels Mar 20, 2020
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Mar 20, 2020
@kkraus14
Copy link
Collaborator

We have similarly poor error messages for old CUDA versions, old driver versions, etc. that we should handle in one swoop.

We should also strive to maintain allowing cudf to be imported on a machine with no GPU for things like API enumeration and whatnot.

@harrism
Copy link
Member

harrism commented Mar 20, 2020

@lmeyerov what version are you on?

@lmeyerov
Copy link
Author

We were getting reports from 0.7 -- 0.11. We're shipping 0.12 next week and then switch internal to 0.13

(0.13/0.14 should be faster b/c the 0.11 & 0.12 upgrades involved 100-200 unit tests & further automation around how we use it)

@lmeyerov
Copy link
Author

Also, we always ship as docker, and in cloud cases but not on-prem, get to control the host as ubuntu 18 + whatever the aws+azure nvidia drivers are at that time

@lmeyerov
Copy link
Author

If overhead on the check is a concern, another option that is fine for us as sw devs is explicit opt-in call.

Ex: something like a healthcheck() or validity() . in opencl, you get back a set of valid devices & their specs, and can even pick which you're using (==> cooperatively schedulable.)

That wouldn't help direct cudf users like data scientists tho.

@jrhemstad
Copy link
Contributor

jrhemstad commented Mar 20, 2020

This is pretty straightforward.

At import cudf:

Does numba or cupy already wrap the appropriate APIs? Or would we need to do so in cuDF cython?

To support @kkraus14's comment of being able to do import cudf on machines without GPUs, you can first do cudaGetDeviceCount() and only run the above if the number of devices is greater than zero.

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 20, 2020

If overhead on the check is a concern, another option that is fine for us as sw devs is explicit opt-in call.

Overhead on the check should be pretty low so I'm not too concerned.

We were getting reports from 0.7 -- 0.11. We're shipping 0.12 next week and then switch internal to 0.13

Note 0.12 has a bunch of non-trivial memory overhead for Strings where you may just want to go from 0.11 --> 0.13 which has the overhead completely removed and additional memory usage improvements.

@lmeyerov
Copy link
Author

lmeyerov commented Mar 21, 2020

We're stuck near-term on 0.12 b/c blazing isn't blessed for 0.13 afaict: https://anaconda.org/blazingsql/blazingsql

But yeah, the 0.12/0.13/0.14 upgrades seem to be battling overhead & memory issues we're seeing, so def excited!

@lmeyerov
Copy link
Author

@jrhemstad Sanity check re:not importing, will doing submodule imports (from cudf.io.parquet...) still trigger running cudf/__init__.py? I'm not up on python module semantics, so not sure if putting the check at module import time will still allow GPU-less module reflection.

@kkraus14
Copy link
Collaborator

@jrhemstad Sanity check re:not importing, will doing submodule imports (from cudf.io.parquet...) still trigger running cudf/__init__.py? I'm not up on python module semantics, so not sure if putting the check at module import time will still allow GPU-less module reflection.

Yes it still will: https://docs.python.org/3/reference/import.html#regular-packages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants