[DOC] Replace OutOfMemory exception with UnsupportedGPU exception #4620

lmeyerov · 2020-03-20T01:55:05Z

Describe the bug

We end up getting deployed in scenarios where cudf, on initialization, throws an RMM exception for out of memory during load, when in reality it is rejecting the available hardware.

This typically impacts our new-to-GPU users and even advanced ones during a config mistakes. E.g., on Azure, the default-available GPUs are K80s (until users jump quota hoops), so the typical Azure first-use experience is to spin up and get this misleading error. It's quite a tough and misleading experience for most people until they've been burnt enough.

We end up doing all sorts of things to try to get users to pick the right env etc. beforehand, but invariably, mistakes will happen, even by advanced users (misconfig, ...).

Not sure if this is better in cudf or rmm.

Steps/Code to reproduce bug

import cudf ; cudf.DataFrame({'x': [1]}) on an old popular GPU like the k80

And/or on common other init next steps like set_alloc

Expected behavior

Fail with UnsupportedDeviceError or something similarly indicative

Environment overview (please complete the following information)

Everywhere. We happen to hit it in Docker.

The text was updated successfully, but these errors were encountered:

kkraus14 · 2020-03-20T02:02:47Z

We have similarly poor error messages for old CUDA versions, old driver versions, etc. that we should handle in one swoop.

We should also strive to maintain allowing cudf to be imported on a machine with no GPU for things like API enumeration and whatnot.

harrism · 2020-03-20T03:11:33Z

@lmeyerov what version are you on?

lmeyerov · 2020-03-20T05:48:49Z

We were getting reports from 0.7 -- 0.11. We're shipping 0.12 next week and then switch internal to 0.13

(0.13/0.14 should be faster b/c the 0.11 & 0.12 upgrades involved 100-200 unit tests & further automation around how we use it)

lmeyerov · 2020-03-20T05:50:22Z

Also, we always ship as docker, and in cloud cases but not on-prem, get to control the host as ubuntu 18 + whatever the aws+azure nvidia drivers are at that time

lmeyerov · 2020-03-20T05:56:08Z

If overhead on the check is a concern, another option that is fine for us as sw devs is explicit opt-in call.

Ex: something like a healthcheck() or validity() . in opencl, you get back a set of valid devices & their specs, and can even pick which you're using (==> cooperatively schedulable.)

That wouldn't help direct cudf users like data scientists tho.

jrhemstad · 2020-03-20T13:39:30Z

This is pretty straightforward.

At import cudf:

GPU compute capability: cudaDeviceGetAttribute and check for cudaDevAttrComputeCapabilityMajor > 6000.
CUDA Runtime: cudaRuntimeGetVersion and check for > 10000
Driver version: check that the result of cudaDriverVersion is >= to the result from cudaRuntimeGetVersion.

Does numba or cupy already wrap the appropriate APIs? Or would we need to do so in cuDF cython?

To support @kkraus14's comment of being able to do import cudf on machines without GPUs, you can first do cudaGetDeviceCount() and only run the above if the number of devices is greater than zero.

kkraus14 · 2020-03-20T17:21:12Z

If overhead on the check is a concern, another option that is fine for us as sw devs is explicit opt-in call.

Overhead on the check should be pretty low so I'm not too concerned.

We were getting reports from 0.7 -- 0.11. We're shipping 0.12 next week and then switch internal to 0.13

Note 0.12 has a bunch of non-trivial memory overhead for Strings where you may just want to go from 0.11 --> 0.13 which has the overhead completely removed and additional memory usage improvements.

lmeyerov · 2020-03-21T18:44:16Z

We're stuck near-term on 0.12 b/c blazing isn't blessed for 0.13 afaict: https://anaconda.org/blazingsql/blazingsql

But yeah, the 0.12/0.13/0.14 upgrades seem to be battling overhead & memory issues we're seeing, so def excited!

lmeyerov · 2020-03-21T18:47:11Z

@jrhemstad Sanity check re:not importing, will doing submodule imports (from cudf.io.parquet...) still trigger running cudf/__init__.py? I'm not up on python module semantics, so not sure if putting the check at module import time will still allow GPU-less module reflection.

kkraus14 · 2020-03-21T19:31:48Z

@jrhemstad Sanity check re:not importing, will doing submodule imports (from cudf.io.parquet...) still trigger running cudf/__init__.py? I'm not up on python module semantics, so not sure if putting the check at module import time will still allow GPU-less module reflection.

Yes it still will: https://docs.python.org/3/reference/import.html#regular-packages

lmeyerov added Needs Triage Need team to review and classify bug Something isn't working labels Mar 20, 2020

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Mar 20, 2020

kkraus14 assigned galipremsagar Mar 20, 2020

galipremsagar mentioned this issue Mar 25, 2020

[REVIEW] Add GPU and CUDA validations #4692

Merged

3 tasks

kkraus14 closed this as completed in #4692 Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC] Replace OutOfMemory exception with UnsupportedGPU exception #4620

[DOC] Replace OutOfMemory exception with UnsupportedGPU exception #4620

lmeyerov commented Mar 20, 2020

kkraus14 commented Mar 20, 2020

harrism commented Mar 20, 2020

lmeyerov commented Mar 20, 2020

lmeyerov commented Mar 20, 2020

lmeyerov commented Mar 20, 2020

jrhemstad commented Mar 20, 2020 •

edited

Loading

kkraus14 commented Mar 20, 2020 •

edited

Loading

lmeyerov commented Mar 21, 2020 •

edited

Loading

lmeyerov commented Mar 21, 2020

kkraus14 commented Mar 21, 2020

[DOC] Replace OutOfMemory exception with UnsupportedGPU exception #4620

[DOC] Replace OutOfMemory exception with UnsupportedGPU exception #4620

Comments

lmeyerov commented Mar 20, 2020

kkraus14 commented Mar 20, 2020

harrism commented Mar 20, 2020

lmeyerov commented Mar 20, 2020

lmeyerov commented Mar 20, 2020

lmeyerov commented Mar 20, 2020

jrhemstad commented Mar 20, 2020 • edited Loading

kkraus14 commented Mar 20, 2020 • edited Loading

lmeyerov commented Mar 21, 2020 • edited Loading

lmeyerov commented Mar 21, 2020

kkraus14 commented Mar 21, 2020

jrhemstad commented Mar 20, 2020 •

edited

Loading

kkraus14 commented Mar 20, 2020 •

edited

Loading

lmeyerov commented Mar 21, 2020 •

edited

Loading