Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce distributed.comm.ucx.environment config slot #7164

Merged
merged 10 commits into from
Oct 24, 2022

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Oct 20, 2022

Can be used for setting arbitrary UCX configuration in addition to the specific high-level options in those cases where one needs fine-grained control over the UCX configuration.

  • Tests added / passed
  • Passes pre-commit run --all-files

cc @pentschev, @quasiben

Can be used for setting arbitrary UCX configuration in addition to the
specific high-level options in those cases where one needs
fine-grained control over the UCX configuration.
distributed/comm/ucx.py Outdated Show resolved Hide resolved
This means that we don't accidentally skip validation for names in two
parts of the hierarchy that happen to collide with a skipped name.
distributed/comm/ucx.py Outdated Show resolved Hide resolved
distributed/comm/ucx.py Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

github-actions bot commented Oct 20, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 24m 54s ⏱️ ±0s
  3 150 tests ±0    3 060 ✔️ +2    85 💤  - 1    5  - 1 
23 307 runs  ±0  22 372 ✔️ ±0  915 💤 +1  20  - 1 

For more details on these failures, see this check.

Results for commit f653811. ± Comparison against base commit 6afce9c.

♻️ This comment has been updated with latest results.

We cannot control (say) transport-level (uct) options when calling
ucp.init, but if the relevant UCX environment variable lives in the
environment at the time we make the call, then it will be respected.
So take any user-provided low-level settings and temporarily insert
into the environment.

Precedence (highest to lowest):

1. Externally specified environment overrides
2. High-level distributed ucx settings
3. Low-level distributed ucx.environment settings
@wence- wence- force-pushed the wence/feature/ucx-environment branch from f8e83a6 to 217cb29 Compare October 20, 2022 16:13
@wence- wence- force-pushed the wence/feature/ucx-environment branch from 217cb29 to 6ca877c Compare October 20, 2022 16:27
"bokeh-application",
"environ",
"pre-spawn-environ",
"distributed.scheduler.default-task-durations",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these changes needed?

Copy link
Contributor Author

@wence- wence- Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema validation in this test traverses the full yaml schema from its root to any leaves. Normally, any object property (mapping to a dict in python) is required to have properties set (indicating which keys are valid in the dict). In some cases (for example the nanny environment) the keys can be arbitrary, so we skip validation.

Previously this skipping was just done on the leaf key name (disregarding the path to the leaf), so could have resulted in false positive skipping (e.g. previously any key name environ would not have had its properties validated even in the case where they existed).

To avoid this, I'm changing the skip list to specify the fully-qualified key names, and remembering the path to the leaf when traversing.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wence- , overall this looks great. I've left a few comments that we probably need to change.

Comment on lines +130 to +135
with patch.dict(os.environ, ucx_environment):
# We carefully ensure that ucx_environment only contains things
# that don't override ucx_config or existing slots in the
# environment, so the user's external environment can safely
# override things here.
ucp.init(options=ucx_config, env_takes_precedence=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice handling!

distributed/comm/ucx.py Outdated Show resolved Hide resolved
distributed/comm/ucx.py Outdated Show resolved Hide resolved
So that users have some useful information if they specifying
overlapping configuration options in both the high level ucx
configuration and directly by setting the environment, report when the
ucx.environment is overridden by either high level options or the
external environment.
@wence- wence- force-pushed the wence/feature/ucx-environment branch from 97a3cb4 to b9c4bcd Compare October 21, 2022 10:10
This checking was a leftover from dask#3515 when users could set arbitrary
options in the top-level ucx configuration slots. Now we construct the
high level options ourselves and can therefore be sure that the
configuration names we will pass to UCX are valid.

Conversely, we cannot check the names of the arbitrary ucx.environment
options because ucp.get_config() will only return UCP-level
configuration variables (and not transport-level UCT options).
This is not complete since UCX_TLS=all may enable CUDA transports, but
we can certainly catch user-specified settings.
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks so much for all the work here @wence- !

@quasiben
Copy link
Member

Generally looks good. @pentschev @wence-, now that TLS=ALL generally works what do you think about adding that as a control flag to the standard UCX config here as well ? So instead of having to set all the config options: nvlink, cuda-copy, infiniband, ... we can (just like with dask-cuda), turn it all on. Would that be too confusing ?

@wence-
Copy link
Contributor Author

wence- commented Oct 21, 2022

Generally looks good. @pentschev @wence-, now that TLS=ALL generally works what do you think about adding that as a control flag to the standard UCX config here as well ? So instead of having to set all the config options: nvlink, cuda-copy, infiniband, ... we can (just like with dask-cuda), turn it all on. Would that be too confusing ?

If no UCX_TLS is specified, that implies all, so the current code works I think.

@pentschev
Copy link
Member

So instead of having to set all the config options: nvlink, cuda-copy, infiniband, ... we can (just like with dask-cuda), turn it all on. Would that be too confusing ?

The only purpose for those flags today is to have a user-friendly way of explicitly enabling/disabling transports that we generally care about. One could just as well set distributed.comm.ucx.environment.tls=tcp,cuda_copy,cuda_ipc for example (equivalent to enable_tcp_over_ucx=True enable_nvlink=True enable_infiniband=False). Generally I think we want to make things more widely available to users, but as a developer I wouldn't mind removing them, as I can just set UCX_TLS specifically to what I need, but that is unlikely to be true for every user, which would require a more detailed understanding at what each transport does.

@quasiben
Copy link
Member

I'm rerunning the failed jobs. Many seemed to have failed test_connection_made_with_extra_conn_args and I'm being a bit cautious. I'll plan to merge after CI finishes

@quasiben
Copy link
Member

Looks like the failures are a known issue:
#7168

@quasiben quasiben merged commit ecf5e93 into dask:main Oct 24, 2022
@pentschev
Copy link
Member

Thanks @wence- for working on this, and @quasiben for merging! 😄

@wence- wence- deleted the wence/feature/ucx-environment branch November 2, 2022 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants