-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address CI failures caused by upstream distributed and cupy changes #993
Address CI failures caused by upstream distributed and cupy changes #993
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. There are some new failures which don't seem related. I'm still unable to reproduce them locally though, I'll see if I can reproduce it on Monday but please keep me posted if you happen to find something in the meantime.
Unfortunately, the failure is indeed related. It turns out that, since we are merging in the global config dictionary in the LocalCUDACluster initializer, we do not pickl up the proper option for the The good news is that d893881 demonstrates that using |
# Disabling compression via environment variable seems to be the only way | ||
# respected here. It is necessary to ensure spilled size matches the actual | ||
# data size. | ||
with patch.dict(os.environ, {"DASK_DISTRIBUTED__COMM__COMPRESSION": "False"}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good news is that the config approach seems to work fine. The bad news is that this env
approach does not :/
Not yet sure if there is a clean way to support both.
My reading of dask/distributed#7028 is that what we were doing with config was correct, but that there is a bug in the way the localcluster merges config options (it should do |
Yes, exactly - Leaving this as "draft" with the hope that the fix can be upstream (and then we can close this). |
It looks like dask/distributed#7069 resolved the issue of global config options not being passed through to the worker. However, we will still see failures in |
@rjzamora thanks for pushing on this. However, the errors are different this time, they're in |
Thanks for explaining that @pentschev - I marked those tests as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rjzamora for fixing this!
@gpucibot merge |
After dask/distributed#7028, the "distributed.comm.ucx"-specificconfig
options being passed down to theLocalCluser
super class are no longer merged with the global options indask.config.config
on the worker. This means that the workers only inherit the "distributed.comm.ucx"-specific options.This PR explicitly merges the "distributed.comm.ucx" options withdask.config.config
withinLocalCUDACluster
. Without this change, CI will fail.The original purpose of this PR has been resolved upstream (see: dask/distributed#7069). This PR now only modifies the tests that are still failing in CI.