Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create LocalCUDACluster #1119

Closed
randerzander opened this issue Feb 16, 2023 · 5 comments
Closed

Unable to create LocalCUDACluster #1119

randerzander opened this issue Feb 16, 2023 · 5 comments

Comments

@randerzander
Copy link
Contributor

With the latest nightlies:

conda list | grep dask
dask                      2023.1.1           pyhd8ed1ab_0    conda-forge
dask-core                 2023.1.1           pyhd8ed1ab_0    conda-forge
dask-cuda                 23.04.00a       py310_230215_g8134e6b_25    rapidsai-nightly
dask-cudf                 23.02.00a230209 cuda_11_py310_gac60656bc9_313    rapidsai-nightly
dask-sql                  2023.2.0+15.gf265f58           dev_0    <develop>

Creating a LocalCUDACluster fails:

from dask_cuda import LocalCUDACluster

if __name__=="__main__":
  cluster = LocalCUDACluster()

Trace:

2023-02-16 09:22:58,422 - distributed.deploy.spec - WARNING - Cluster closed without starting up
Traceback (most recent call last):
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/deploy/spec.py", line 319, in _start
    self.scheduler = cls(**self.scheduler_spec.get("options", {}))
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/scheduler.py", line 3662, in _
_init__
    ServerNode.__init__(
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/core.py", line 348, in __init__
    self.monitor = SystemMonitor()
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/system_monitor.py", line 96, in __init__
    gpu_extra = nvml.one_time()
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/diagnostics/nvml.py", line 336, in one_time
    "name": _get_name(h),
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/diagnostics/nvml.py", line 319, in _get_name
    return pynvml.nvmlDeviceGetName(h).decode()
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nfs/rgelhausen/projects/presto-test/simple.py", line 4, in <module>
    cluster = LocalCUDACluster()
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/dask_cuda/local_cuda_cluster.py", line 336, in __init__
    super().__init__(
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/deploy/local.py", line 253, in __init__
    super().__init__(
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/deploy/spec.py", line 286, in __init__
    self.sync(self._start)
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/utils.py", line 338, in sync
    return sync(
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/utils.py", line 405, in sync
    raise exc.with_traceback(tb)
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/utils.py", line 378, in f
    result = yield future
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/tornado/gen.py", line 769, in run
    value = future.result()
  File "/raid/rgelhausen/conda/envs/pynds3/lib/python3.10/site-packages/distributed/deploy/spec.py", line 330, in _start
    raise RuntimeError(f"Cluster failed to start: {e}") from e
RuntimeError: Cluster failed to start: 'str' object has no attribute 'decode'
@wence-
Copy link
Contributor

wence- commented Feb 16, 2023

Should have been fixed by dask/distributed#7544 and #1118

Workaround, downgrade pynvml.

@randerzander
Copy link
Contributor Author

mamba install -c conda-forge pynvml=11.4.1

Fixed this for me.

@wence-
Copy link
Contributor

wence- commented Feb 17, 2023

This was fixed in #1118, but one problem is that if you just mamba install -c rapidsai-nightly dask-cuda=23.04* then you don't get a distributed nightly AFAICT (see discussion in the distributed issue) and therefore things still break. @jakirkham what would be the right way to fix this?

@jakirkham
Copy link
Member

One needs to add the dask/label/dev channel for Dask nightlies

@wence-
Copy link
Contributor

wence- commented Mar 1, 2023

I think this is now fixed.

@wence- wence- closed this as completed Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants