-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PyNVML and set upper pin #1130
Update PyNVML and set upper pin #1130
Conversation
Handling the str vs. bytes discrepancy should have been covered by the changes in rapidsai#1118.
There were two pins in the PR below, but only one unpin in this PR. Should |
Oh thanks, I suspect so (pushed that change). Thanks for the sharp eyes! |
One CI job is failing with this error: Unable to start CUDA Context
Traceback (most recent call last):
File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 850, in _nvmlGetFunctionPointer
_nvmlGetFunctionPointer_cache[name] = getattr(nvmlLib, name)
File "/opt/conda/envs/test/lib/python3.8/ctypes/__init__.py", line 386, in __getattr__
func = self.__getitem__(name)
File "/opt/conda/envs/test/lib/python3.8/ctypes/__init__.py", line 391, in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v3
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/envs/test/lib/python3.8/site-packages/dask_cuda/initialize.py", line 31, in _create_cuda_context
distributed.comm.ucx.init_once()
File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/comm/ucx.py", line 136, in init_once
pre_existing_cuda_context = has_cuda_context()
File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 219, in has_cuda_context
if _running_process_matches(handle):
File "/opt/conda/envs/test/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 179, in _running_process_matches
running_processes = pynvml.nvmlDeviceGetComputeRunningProcesses(handle)
File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 2608, in nvmlDeviceGetComputeRunningProcesses
return nvmlDeviceGetComputeRunningProcesses_v3(handle);
File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 2576, in nvmlDeviceGetComputeRunningProcesses_v3
fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
File "/opt/conda/envs/test/lib/python3.8/site-packages/pynvml/nvml.py", line 853, in _nvmlGetFunctionPointer
raise NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)
pynvml.nvml.NVMLError_FunctionNotFound: Function Not Found |
Rerunning CI to see if the Dask |
I imagine the problem is that pynvml has been updated to require a v3 version of a function in nvml, but that doesn't exist in cuda 11.2? |
This is WIP until such time as a solution for backwards compat is decided on in nvidia-ml-py (and/or pynvml). So until then we should just keep pynvml at 11.4.1 |
Going to double check this, but my understanding is we want PyNVML 11.5 for CUDA 12 support |
Agreed, but it seems we need that fix to land in nvidia-ml-py first as we can't work around that in a reasonable manner. |
I don't think that is necessary, unless we need features in nvml that were only introduced in cuda 12. Specifically, I have CTK 12 on my system, I install pvnvml < 11.5, and all the queries work. The C API preserves backwards compatibility so old versions of pynvml work fine with new versions of libnvidia-ml.so. The problem is the other way round, new versions of pynvml don't work with old versions of libnvidia-ml.so. |
This pending resolution of NVBug 4008080. |
Curious where things landed here. AFAICT this is still pinned |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Lawrence and Peter! 🙏
/merge |
Handling the str vs. bytes discrepancy should have been covered by the changes in #1118.