ec2-multi: Dask EC2 cluster example timing out #455
Labels
bug
Something isn't working
cloud/aws
Amazon Web Service cloud
help wanted
Extra attention is needed
Description
For #434, I tried running the steps at https://docs.rapids.ai/deployment/nightly/cloud/aws/ec2-multi/ using the 24.10 version of RAPIDS (which pulls in
dask==2024.9.0
anddistributed==2024.9.0
).That example uses
dask_cloudprovider.aws.EC2Cluster
to create a cluster of EC2 instances, then encourages users to try creating acudf
DataFrame, distributing it over the cluster, and operating on it in parallel (see "Reproducible Example" below).I saw
EC2Cluster()
successfully create a cluster of EC2 instances (1 scheduler and 3 workers), but the first time I tried to call.compute()
on adask_cudf.DataFrame
to get a result back from the cluster, that operation hung indefinitely (did not time out after 15 minutes).Reproducible Example
I created an EC2 instances using the "NVIDIA GPU-Optimized AMI", following https://docs.rapids.ai/deployment/nightly/cloud/aws/ec2/.
Then ran the following:
docker run \ --rm \ --gpus all \ --network host \ --env AWS_PROFILE=oct2024 \ --entrypoint="" \ -it rapidsai/notebooks:24.10a-cuda12.5-py3.11 \ bash
Created a file
~/.aws/credentials
in the container, with an[oct2024]
profile.Installed
dask-cloudprovider
, with the same version asdistributed
anddask
.conda install --yes -c rapidsai-nightly -c conda-forge \ 'dask-cloudprovider==2024.9.0'
Tried running the example code.
This warning is emitted immediately
And then that
.mean().compute()
does not complete (after 15+ minutes).output of 'conda info' (click me)
output of 'conda env export' (click me)
Notes
Maybe related: #343
The text was updated successfully, but these errors were encountered: