Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically configure enable_tcp_over_ucx by default when protocol=ucx in LocalCUDACluster #424

Closed
beckernick opened this issue Oct 20, 2020 · 12 comments
Labels
inactive-30d inactive-90d proposal A code change suggestion to be considered or discussed

Comments

@beckernick
Copy link
Member

beckernick commented Oct 20, 2020

Today, I configure a UCX-enabled cluster with some values for the following four arguments:

  • protocol="ucx"
  • enable_tcp_over_ucx
  • enable_infiniband
  • enable_nvlink

In general, the common practice is to enable NVLink, possibly enable InfiniBand, and always enable TCP over UCX. While there are scenarios in which we would want to disable TCP over UCX, by and large we enable it. To reduce the number of parameters that users must tweak to get UCX working, it might be nice to:

  • Change the default of enable_tcp_over_ucx to None
  • Automatically configure enable_tcp_over_ucx=True when a user leaves enable_tcp_over_ucx as None and protocol="ucx",
  • Allow users to pass enable_tcp_over_ucx=False if desired, with the corresponding expected behavior

Another approach might be to instead switch this to simply be disable_tcp_over_ucx and have it otherwise be implicitly set to True, if that is preferred. cc @quasiben for thoughts as we were discussing this earlier.

@pentschev
Copy link
Member

The reason I enforced users to set enable_tcp_over_ucx=True is to make sure they indeed notice they are using protocol="ucx". I know this is somewhat annoying, but I fear for those who have protocol="ucx" somewhere where they really want to just use the default TCP.

@pentschev
Copy link
Member

Also, why would you pass enable_tcp_over_ucx=None rather than True or False? To me, if we decided to enable it by default, what we would do is to just modify constructors to enable_tcp_over_ucx=True, and None will continue to be an invalid choice.

@beckernick
Copy link
Member Author

beckernick commented Oct 21, 2020

I agree. I suggested the None to try to closely match the existing behavior, but I prefer what you're describing.

I know this is somewhat annoying, but I fear for those who have protocol="ucx" somewhere where they really want to just use the default TCP.

I think this is absolutely a potential risk. Do we know of any examples in which someone sets protocol="ucx" and actually wanted to use TCP?

My thinking is that if someone wants to switch between UCX and TCP, ideally that can be done by one parameter. Keeping two parameters to protect against accidental setting of one parameter adds some complexity on the user side, and the protocol mixup may be something that doesn't happen too much in practice.

@pentschev
Copy link
Member

My thinking is that if someone wants to switch between UCX and TCP, ideally that can be done by one parameter. Keeping two parameters to protect against accidental setting of one parameter adds some complexity on the user side, and the protocol mixup may be something that doesn't happen too much in practice.

There's actually another culprit, if we make enable_tcp_over_ucx=True by default, then UCX will immediately seem to work and be slow because the correct transports haven't been enabled by the user. We intend to make things transparent and let UCX decide what transports to use, but we're not there yet because there are several limitations that exist today, for example the selection of proper IB device.

Summarizing:

  1. The default enable_tcp_over_ucx=False can be annoying for experienced users. However, it protects from people who just accidentally switch to protocol="ucx" when they wanted protocol="tcp", and more importantly from those who just expect UCX will just be fast out-of-the-box without need to specify transports;
  2. We should improve Dask-CUDA (more of something for UCX/UCX-Py) and make it more robust w.r.t. transport selection, and let UCX figure transports out by itself.

I do agree that the current behavior is annoying, but the alternative is today to let people think switching to protocol="ucx" will immediately make everything fast, which isn't the case yet. I'm in favor of item 1 above, but I'd strongly prefer that we have 2 figured out first.

@pentschev
Copy link
Member

To make myself even clearer, from my experience with users over the last year with Dask-CUDA, they DON'T read the documentation regarding UCX, so switching the default to enable_tcp_over_ucx=True now would mean an increased number of questions in the sense of "Why is UCX so slow when I have capable HW?". To prevent that, I personally prefer to force users to read the error when they're getting started for the time being.

@quasiben
Copy link
Member

I think the general ask here is that we try and reduce the the number of configurations for the common case. So when, for example, we have nvlink and/or inifiniband set to True, we automatically configure for tcp as well. My understanding is that we trying to handle the case where UCX can be configured with RDMA/RDMACM, is that correct ?

@pentschev
Copy link
Member

I think the general ask here is that we try and reduce the the number of configurations for the common case. So when, for example, we have nvlink and/or inifiniband set to True, we automatically configure for tcp as well.

If that's the case, then you just proved my point about documentation:

enable_infiniband: bool
Set environment variables to enable UCX InfiniBand support, requires
protocol='ucx' and implies enable_tcp_over_ucx=True.

enable_nvlink: bool
Set environment variables to enable UCX NVLink support, requires
protocol='ucx' and implies enable_tcp_over_ucx=True.

@quasiben
Copy link
Member

Yes, I'm sorry :(

Looking again at the code, the minimum needed is:

  • enable_infiniband
  • enable_nvlink

Because we also will set the protocol correctly when either those variables are set:

if enable_tcp_over_ucx or enable_infiniband or enable_nvlink:
if protocol is None:
protocol = "ucx"
elif protocol != "ucx":
raise TypeError("Enabling InfiniBand or NVLink requires protocol='ucx'")

Is there a way we should change the docs to not overload users with options as their first introduction to dask-cuda+UCX ?

@pentschev
Copy link
Member

I don't really understand the question, if one of those options evaluate to True, we automatically set protocol="ucx". Are you suggesting that we force changing to protocol="ucx" even if the user explicitly passed protocol="tcp" instead of erroring?

@pentschev pentschev added feature request New feature or request proposal A code change suggestion to be considered or discussed and removed feature request New feature or request labels Jan 8, 2021
@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@pentschev
Copy link
Member

Since we now support automatic UCX configuration by default as of #792 , I believe this issue is not relevant anymore. Tentatively closing it, please feel free to reopen if there is still anything needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive-30d inactive-90d proposal A code change suggestion to be considered or discussed
Projects
None yet
Development

No branches or pull requests

3 participants