Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make nvml access safer & more compatible across devices #2790

Closed
wants to merge 1 commit into from

Conversation

joeyballentine
Copy link
Member

@joeyballentine joeyballentine commented Apr 12, 2024

To prevent stuff like this if someone has old asss drivers or semi-unsupported GPUs: https://discord.com/channels/930865462852591648/930875396277280788/1228445525540339842

Since these bindings link to drivers, old drivers may or may not have any of these functions. so we can't be sure about any of them.

Yes, its ugly. But it prevents crashing.

@RunDevelopment
Copy link
Member

Oh yeah, absolutely not. You're just making up garbage data.

That user even gave you the exact function that failed, so why are you even try-catching anything that is called before that function? All you needed was a wrapper around nv.nvmlDeviceGetArchitecture to return nv.NVML_DEVICE_ARCH_UNKNOWN and maybe something for get_current_vram_usage and supports_fp16(self, ...). That would have been enough.

@joeyballentine
Copy link
Member Author

I just wanted to be extra safe. Imagine if Nvidia change all their functions tomorrow in a driver update. Suddenly nobody would be able to use the pytorch nodes. None of these functions are critical to the usability of pytorch and yet if they don't work, our pytorch nodes won't either.

@RunDevelopment
Copy link
Member

Imagine if Nvidia change all their functions tomorrow in a driver update.

That's an unreasonable expectation. Nvml is an Nvidia library. If Nvidia does change their drivers, it's their job to make sure nvml keeps working, not ours.

None of these functions are critical to the usability of pytorch and yet if they don't work, our pytorch nodes won't either.

My problem isn't that you catch errors, but how. You essentially moved the error handling logic as far down the call stack as you could. The result is that you added 10 catches and that PyTorch node still won't show up if there's a bug in NvidiaHelper elsewhere.

Move the error handling up. All you had to do to make PyTorch nodes work as this:

# backend\src\packages\chaiNNer_pytorch\settings.py

should_fp16 = False
try:
    nv = get_nvidia_helper()
    if nv is not None:
        should_fp16 = nv.supports_fp16()
except Exception as e:
    logger.warn(e)
if is_arm_mac:
    should_fp16 = True

And in addition to that, I would still use nv.NVML_DEVICE_ARCH_UNKNOWN as the default value for nv.nvmlDeviceGetArchitecture in case it fails, because this default value makes sense.

@joeyballentine joeyballentine changed the title Wrap nvml calls with try/excepts Make nvml access safer & more compatible across devices Apr 13, 2024
Comment on lines +66 to +68
arch=nv.nvmlDeviceGetArchitecture(handle)
if getattr(nv, "nvmlDeviceGetArchitecture", None) is not None
else nv.NVML_DEVICE_ARCH_UNKNOWN,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't fix the error the user had. The error was

chaiNNer\python\python\Lib\site-packages\pynvml\nvml.py", line 4163, in nvmlDeviceGetArchitecture

pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found

The error occurs inside nvmlDeviceGetArchitecture and here's the code of that function:

def nvmlDeviceGetArchitecture(device):
    arch = _nvmlDeviceArchitecture_t()
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetArchitecture")
    ret = fn(device, byref(arch))
    _nvmlCheckReturn(ret)
    return arch.value

The problem isn't that the python function nvmlDeviceGetArchitecture doesn't exist, but that the C API can't find the function nvmlDeviceGetArchitecture is whatever DLL from the driver nvml loads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh

@joeyballentine
Copy link
Member Author

@RunDevelopment would you mind just doing this the way you want it? I'm going to close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants