-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce distributed.comm.ucx.environment config slot #7164
Conversation
Can be used for setting arbitrary UCX configuration in addition to the specific high-level options in those cases where one needs fine-grained control over the UCX configuration.
This means that we don't accidentally skip validation for names in two parts of the hierarchy that happen to collide with a skipped name.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 24m 54s ⏱️ ±0s For more details on these failures, see this check. Results for commit f653811. ± Comparison against base commit 6afce9c. ♻️ This comment has been updated with latest results. |
We cannot control (say) transport-level (uct) options when calling ucp.init, but if the relevant UCX environment variable lives in the environment at the time we make the call, then it will be respected. So take any user-provided low-level settings and temporarily insert into the environment. Precedence (highest to lowest): 1. Externally specified environment overrides 2. High-level distributed ucx settings 3. Low-level distributed ucx.environment settings
f8e83a6
to
217cb29
Compare
217cb29
to
6ca877c
Compare
"bokeh-application", | ||
"environ", | ||
"pre-spawn-environ", | ||
"distributed.scheduler.default-task-durations", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these changes needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema validation in this test traverses the full yaml schema from its root to any leaves. Normally, any object
property (mapping to a dict
in python) is required to have properties
set (indicating which keys are valid in the dict). In some cases (for example the nanny environment) the keys can be arbitrary, so we skip validation.
Previously this skipping was just done on the leaf key name (disregarding the path to the leaf), so could have resulted in false positive skipping (e.g. previously any key name environ
would not have had its properties validated even in the case where they existed).
To avoid this, I'm changing the skip list to specify the fully-qualified key names, and remembering the path to the leaf when traversing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wence- , overall this looks great. I've left a few comments that we probably need to change.
with patch.dict(os.environ, ucx_environment): | ||
# We carefully ensure that ucx_environment only contains things | ||
# that don't override ucx_config or existing slots in the | ||
# environment, so the user's external environment can safely | ||
# override things here. | ||
ucp.init(options=ucx_config, env_takes_precedence=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice handling!
So that users have some useful information if they specifying overlapping configuration options in both the high level ucx configuration and directly by setting the environment, report when the ucx.environment is overridden by either high level options or the external environment.
97a3cb4
to
b9c4bcd
Compare
This checking was a leftover from dask#3515 when users could set arbitrary options in the top-level ucx configuration slots. Now we construct the high level options ourselves and can therefore be sure that the configuration names we will pass to UCX are valid. Conversely, we cannot check the names of the arbitrary ucx.environment options because ucp.get_config() will only return UCP-level configuration variables (and not transport-level UCT options).
This is not complete since UCX_TLS=all may enable CUDA transports, but we can certainly catch user-specified settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks so much for all the work here @wence- !
Generally looks good. @pentschev @wence-, now that TLS=ALL generally works what do you think about adding that as a control flag to the standard UCX config here as well ? So instead of having to set all the config options: nvlink, cuda-copy, infiniband, ... we can (just like with dask-cuda), turn it all on. Would that be too confusing ? |
If no UCX_TLS is specified, that implies all, so the current code works I think. |
The only purpose for those flags today is to have a user-friendly way of explicitly enabling/disabling transports that we generally care about. One could just as well set |
I'm rerunning the failed jobs. Many seemed to have failed |
Looks like the failures are a known issue: |
Can be used for setting arbitrary UCX configuration in addition to the specific high-level options in those cases where one needs fine-grained control over the UCX configuration.
pre-commit run --all-files
cc @pentschev, @quasiben