Make nvml access safer & more compatible across devices #2790

joeyballentine · 2024-04-12T21:14:34Z

To prevent stuff like this if someone has old asss drivers or semi-unsupported GPUs: https://discord.com/channels/930865462852591648/930875396277280788/1228445525540339842

Since these bindings link to drivers, old drivers may or may not have any of these functions. so we can't be sure about any of them.

Yes, its ugly. But it prevents crashing.

RunDevelopment · 2024-04-12T21:47:19Z

Oh yeah, absolutely not. You're just making up garbage data.

That user even gave you the exact function that failed, so why are you even try-catching anything that is called before that function? All you needed was a wrapper around nv.nvmlDeviceGetArchitecture to return nv.NVML_DEVICE_ARCH_UNKNOWN and maybe something for get_current_vram_usage and supports_fp16(self, ...). That would have been enough.

joeyballentine · 2024-04-12T22:13:07Z

I just wanted to be extra safe. Imagine if Nvidia change all their functions tomorrow in a driver update. Suddenly nobody would be able to use the pytorch nodes. None of these functions are critical to the usability of pytorch and yet if they don't work, our pytorch nodes won't either.

RunDevelopment · 2024-04-13T11:18:19Z

Imagine if Nvidia change all their functions tomorrow in a driver update.

That's an unreasonable expectation. Nvml is an Nvidia library. If Nvidia does change their drivers, it's their job to make sure nvml keeps working, not ours.

None of these functions are critical to the usability of pytorch and yet if they don't work, our pytorch nodes won't either.

My problem isn't that you catch errors, but how. You essentially moved the error handling logic as far down the call stack as you could. The result is that you added 10 catches and that PyTorch node still won't show up if there's a bug in NvidiaHelper elsewhere.

Move the error handling up. All you had to do to make PyTorch nodes work as this:

# backend\src\packages\chaiNNer_pytorch\settings.py

should_fp16 = False
try:
    nv = get_nvidia_helper()
    if nv is not None:
        should_fp16 = nv.supports_fp16()
except Exception as e:
    logger.warn(e)
if is_arm_mac:
    should_fp16 = True

And in addition to that, I would still use nv.NVML_DEVICE_ARCH_UNKNOWN as the default value for nv.nvmlDeviceGetArchitecture in case it fails, because this default value makes sense.

RunDevelopment · 2024-04-13T20:07:34Z

backend/src/gpu.py

+                    arch=nv.nvmlDeviceGetArchitecture(handle)
+                    if getattr(nv, "nvmlDeviceGetArchitecture", None) is not None
+                    else nv.NVML_DEVICE_ARCH_UNKNOWN,


This doesn't fix the error the user had. The error was

chaiNNer\python\python\Lib\site-packages\pynvml\nvml.py", line 4163, in nvmlDeviceGetArchitecture pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

The error occurs inside nvmlDeviceGetArchitecture and here's the code of that function:

def nvmlDeviceGetArchitecture(device): arch = _nvmlDeviceArchitecture_t() fn = _nvmlGetFunctionPointer("nvmlDeviceGetArchitecture") ret = fn(device, byref(arch)) _nvmlCheckReturn(ret) return arch.value

The problem isn't that the python function nvmlDeviceGetArchitecture doesn't exist, but that the C API can't find the function nvmlDeviceGetArchitecture is whatever DLL from the driver nvml loads.

joeyballentine · 2024-04-26T14:38:17Z

@RunDevelopment would you mind just doing this the way you want it? I'm going to close this PR.

Redo implementation

61ebd56

joeyballentine force-pushed the safe-gpu-accesor branch from 290655e to 61ebd56 Compare April 13, 2024 17:26

joeyballentine changed the title ~~Wrap nvml calls with try/excepts~~ Make nvml access safer & more compatible across devices Apr 13, 2024

RunDevelopment reviewed Apr 13, 2024

View reviewed changes

joeyballentine closed this Apr 26, 2024

RunDevelopment mentioned this pull request Apr 27, 2024

Rework GPU API #2821

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make nvml access safer & more compatible across devices #2790

Make nvml access safer & more compatible across devices #2790

joeyballentine commented Apr 12, 2024 •

edited

Loading

RunDevelopment commented Apr 12, 2024

joeyballentine commented Apr 12, 2024

RunDevelopment commented Apr 13, 2024

RunDevelopment Apr 13, 2024

joeyballentine Apr 13, 2024

joeyballentine commented Apr 26, 2024

Make nvml access safer & more compatible across devices #2790

Make nvml access safer & more compatible across devices #2790

Conversation

joeyballentine commented Apr 12, 2024 • edited Loading

RunDevelopment commented Apr 12, 2024

joeyballentine commented Apr 12, 2024

RunDevelopment commented Apr 13, 2024

RunDevelopment Apr 13, 2024

Choose a reason for hiding this comment

joeyballentine Apr 13, 2024

Choose a reason for hiding this comment

joeyballentine commented Apr 26, 2024

joeyballentine commented Apr 12, 2024 •

edited

Loading