-
-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make nvml access safer & more compatible across devices #2790
Conversation
Oh yeah, absolutely not. You're just making up garbage data. That user even gave you the exact function that failed, so why are you even try-catching anything that is called before that function? All you needed was a wrapper around |
I just wanted to be extra safe. Imagine if Nvidia change all their functions tomorrow in a driver update. Suddenly nobody would be able to use the pytorch nodes. None of these functions are critical to the usability of pytorch and yet if they don't work, our pytorch nodes won't either. |
That's an unreasonable expectation. Nvml is an Nvidia library. If Nvidia does change their drivers, it's their job to make sure nvml keeps working, not ours.
My problem isn't that you catch errors, but how. You essentially moved the error handling logic as far down the call stack as you could. The result is that you added 10 catches and that PyTorch node still won't show up if there's a bug in Move the error handling up. All you had to do to make PyTorch nodes work as this: # backend\src\packages\chaiNNer_pytorch\settings.py
should_fp16 = False
try:
nv = get_nvidia_helper()
if nv is not None:
should_fp16 = nv.supports_fp16()
except Exception as e:
logger.warn(e)
if is_arm_mac:
should_fp16 = True And in addition to that, I would still use |
290655e
to
61ebd56
Compare
arch=nv.nvmlDeviceGetArchitecture(handle) | ||
if getattr(nv, "nvmlDeviceGetArchitecture", None) is not None | ||
else nv.NVML_DEVICE_ARCH_UNKNOWN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't fix the error the user had. The error was
chaiNNer\python\python\Lib\site-packages\pynvml\nvml.py", line 4163, in nvmlDeviceGetArchitecture
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found
The error occurs inside nvmlDeviceGetArchitecture
and here's the code of that function:
def nvmlDeviceGetArchitecture(device):
arch = _nvmlDeviceArchitecture_t()
fn = _nvmlGetFunctionPointer("nvmlDeviceGetArchitecture")
ret = fn(device, byref(arch))
_nvmlCheckReturn(ret)
return arch.value
The problem isn't that the python function nvmlDeviceGetArchitecture
doesn't exist, but that the C API can't find the function nvmlDeviceGetArchitecture
is whatever DLL from the driver nvml loads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh
@RunDevelopment would you mind just doing this the way you want it? I'm going to close this PR. |
To prevent stuff like this if someone has old asss drivers or semi-unsupported GPUs: https://discord.com/channels/930865462852591648/930875396277280788/1228445525540339842
Since these bindings link to drivers, old drivers may or may not have any of these functions. so we can't be sure about any of them.
Yes, its ugly. But it prevents crashing.