Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address CI failures caused by upstream distributed and cupy changes #993

Merged

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Sep 23, 2022

After dask/distributed#7028, the "distributed.comm.ucx"-specific config options being passed down to the LocalCluser super class are no longer merged with the global options in dask.config.config on the worker. This means that the workers only inherit the "distributed.comm.ucx"-specific options.

This PR explicitly merges the "distributed.comm.ucx" options with dask.config.config within LocalCUDACluster. Without this change, CI will fail.

The original purpose of this PR has been resolved upstream (see: dask/distributed#7069). This PR now only modifies the tests that are still failing in CI.

@rjzamora rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Sep 23, 2022
@rjzamora rjzamora requested a review from a team as a code owner September 23, 2022 17:08
@github-actions github-actions bot added the python python code needed label Sep 23, 2022
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. There are some new failures which don't seem related. I'm still unable to reproduce them locally though, I'll see if I can reproduce it on Monday but please keep me posted if you happen to find something in the meantime.

@rjzamora
Copy link
Member Author

This looks good to me. There are some new failures which don't seem related. I'm still unable to reproduce them locally though, I'll see if I can reproduce it on Monday but please keep me posted if you happen to find something in the meantime.

Unfortunately, the failure is indeed related. It turns out that, since we are merging in the global config dictionary in the LocalCUDACluster initializer, we do not pickl up the proper option for the "DASK_DISTRIBUTED__COMM__COMPRESSION" environment variable. It is not yet clear to me if there is a way to fix this.

The good news is that d893881 demonstrates that using dask.config now works fine for disabling compression.

Comment on lines -210 to -213
# Disabling compression via environment variable seems to be the only way
# respected here. It is necessary to ensure spilled size matches the actual
# data size.
with patch.dict(os.environ, {"DASK_DISTRIBUTED__COMM__COMPRESSION": "False"}):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good news is that the config approach seems to work fine. The bad news is that this env approach does not :/

Not yet sure if there is a clean way to support both.

@wence-
Copy link
Contributor

wence- commented Sep 26, 2022

My reading of dask/distributed#7028 is that what we were doing with config was correct, but that there is a bug in the way the localcluster merges config options (it should do dask.config["distributed"] | config), so that most of this should not be needed once that is fixed?

@rjzamora rjzamora marked this pull request as draft September 26, 2022 13:56
@rjzamora
Copy link
Member Author

My reading of dask/distributed#7028 is that what we were doing with config was correct, but that there is a bug in the way the localcluster merges config options (it should do dask.config["distributed"] | config), so that most of this should not be needed once that is fixed?

Yes, exactly - Leaving this as "draft" with the hope that the fix can be upstream (and then we can close this).

@rjzamora
Copy link
Member Author

It looks like dask/distributed#7069 resolved the issue of global config options not being passed through to the worker. However, we will still see failures in dask_cuda/tests/test_spill.py::test_cudf_cluster_device_spill, because the dask.config defaults will already be set before the environment variable is patched. I submitted dask/distributed#7070 to document this behavior, but I'm not sure it should be considered a real "bug".

@pentschev
Copy link
Member

@rjzamora thanks for pushing on this. However, the errors are different this time, they're in dask_cuda.tests.test_proxy.test_cupy_matmul (and imatmul). I briefly discussed this with @madsbk earlier, and this happens on CuPy >= 11, probably because the array access has changed. Since JIT-Unspill and CuPy don't currently work without compatibility mode, Mads suggested we disable this test. Could mark it to xfail for now?

@rjzamora rjzamora changed the title Fix LocalCUDACluster config initialization Address CI failures caused by upstream distributed and cupy changes Sep 26, 2022
@rjzamora
Copy link
Member Author

Could mark it to xfail for now?

Thanks for explaining that @pentschev - I marked those tests as xfail for now (for cupy>=11.0)

@rjzamora rjzamora marked this pull request as ready for review September 26, 2022 17:22
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora for fixing this!

@pentschev
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4f68949 into rapidsai:branch-22.10 Sep 26, 2022
@rjzamora rjzamora deleted the fix-localcudacluster-config branch September 26, 2022 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants