Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve UCX documentation and samples #544

Closed
pentschev opened this issue Mar 4, 2021 · 16 comments
Closed

Improve UCX documentation and samples #544

pentschev opened this issue Mar 4, 2021 · 16 comments
Assignees
Labels
2 - In Progress Currently a work in progress doc Documentation inactive-30d

Comments

@pentschev
Copy link
Member

We currently have some documentation on using UCX with Dask-CUDA available in https://dask-cuda.readthedocs.io/en/latest/ucx.html . However, I feel that documentation is a bit convoluted, so reviewing and improving its text is a good idea. Furthermore, it could be easier for readers if they have a small set of samples they can refer to inside the dask-cuda repository, perhaps under a new directory samples/ucx, where we could have a few simple scripts to run a single-node cluster with LocalCUDACluster, as well as one or more scripts to run dask-scheduler + dask-cuda-worker (possibly more than just one, for multi-node clusters) + client code.

@charlesbluca is this something you would be interested in doing? There's no rush on it, I just thought someone like you would be perfect for it, as you still don't have the bias on writing all the options by heart. 🙂

@pentschev pentschev added 0 - Backlog In queue waiting for assignment doc Documentation labels Mar 4, 2021
@charlesbluca
Copy link
Member

Happy to take this on! I don't have as much experience setting up UCX through the CLI, but we can use this issue to discuss that when the time comes.

@charlesbluca
Copy link
Member

charlesbluca commented Mar 4, 2021

Just now looking into this! A question on the UCX configuation - are the listed environment variables being processed through dask.config? If so, maybe the "Configuration" section could be cleaned up a bit by adding these options to the Dask configuration reference, which we could then link to for more complicated UCX options. Then maybe for the mandatory settings, we could make a ucx.yaml configuration that includes them all?

ucx:
  cuda-copy: true # required
  tcp: true # required
  nvlink: true # required for NVLink
  infiniband: true # required for InfiniBand
  rdmacm: true # recommended for IB
  net-devices: 'mlx5_0:1' # important to set to 'mlx*****' with IB enabled, otherwise Ethernet device
rmm:
  pool-size: 1GB # recommended to prevent Dask scheduler crashes

@charlesbluca charlesbluca self-assigned this Mar 5, 2021
@charlesbluca charlesbluca added 2 - In Progress Currently a work in progress and removed 0 - Backlog In queue waiting for assignment labels Mar 5, 2021
@pentschev
Copy link
Member Author

Just now looking into this! A question on the UCX configuation - are the listed environment variables being processed through dask.config? If so, maybe the "Configuration" section could be cleaned up a bit by adding these options to the Dask configuration reference, which we could then link to for more complicated UCX options.

I agree that the configuration reference page could be improved with descriptions for the sections that say "No Comment". However, I think we still should list them in our docs because it's very easy to introduce typos, for example note that there is a double-underscore after DASK_UCX.

Then maybe for the mandatory settings, we could make a ucx.yaml configuration that includes them all?

This is yet another thing I would normally prefer not to. We can always add that as an option for users, but not necessarily suggest it by default -- and by the way, these configurations already live in ~/.config/dask/distributed.yaml along with all other Distributed configurations, so we won't have a ucx.yaml only for that. My reasoning for that is these options become implicit and harder to track or even know they are defined when starting a Dask cluster, for example. I would any day prefer to have another 50 lines in my script so I know exactly what I'm setting, rather than rely on some configuration file I don't remember I did set a year ago in some machine I don't use very often, plus if you're testing different transports, this becomes burdensome very quickly. Summarizing: I'm fine providing that as an option, but not as the only option.

@charlesbluca
Copy link
Member

charlesbluca commented Mar 5, 2021

I think we still should list them in our docs because it's very easy to introduce typos, for example note that there is a double-underscore after DASK_UCX.

That was my motivation behind listing the variables in their Dask config form (e.g. ucx.cuda_copy), as at least from my perspective that might be more readable and less likely to result in typos - although it does come with the assumption that a user is already familiar with Dask configuration and knows the mapping between environment variables and the dask.config dictionary. This could potentially be rectified if we link to the Dask configuration section on environment variables at some point before we show examples using them.

I would any day prefer to have another 50 lines in my script so I know exactly what I'm setting, rather than rely on some configuration file I don't remember I did set a year ago in some machine I don't use very often, plus if you're testing different transports, this becomes burdensome very quickly.

You're right, it's better to encourage explicitly setting these variables on a per-run basis rather than setting implicitly for all runs. In that case, I think a good way to showcase setting the options without environment variables would be to use dask.config.set(...) instead of os.environ[...] in the Python examples.

@charlesbluca
Copy link
Member

charlesbluca commented Mar 5, 2021

I think a good way to showcase setting the options without environment variables would be to use dask.config.set(...) instead of os.environ[...] in the Python examples.

Actually maybe the client configuration would be better suited for dask_cuda.initialize.initialize(...)?

@pentschev
Copy link
Member Author

You're right, it's better to encourage explicitly setting these variables on a per-run basis rather than setting implicitly for all runs. In that case, I think a good way to showcase setting the options without environment variables would be to use dask.config.set(...) instead of os.environ[...] in the Python examples.

To provide some hope here, I expect that in the mid-term we can greatly reduce the number of options we need to pass to UCX as things become more stable. With that said, I would be happy enough if we have good docs and samples now even if they're a bit convoluted. I think in either case the number of variables that need to be set are the biggest problem.

Actually maybe the client configuration would be better suited for dask_cuda.initialize.initialize(...)?

Do you mean to use dask_cuda.initialize.initialize rather than setting environment variables? If so, yes, they should be interchangeable. Some people (like me) may still prefer to set environment variables when running from the command line instead of having to modify the client code if they need to test different options. I think the best is to document both cases and let the user choose, but also make it clear that they should use one xor the other, because otherwise it becomes difficult to know which one actually has precedence.

@charlesbluca
Copy link
Member

I agree, samples are definitely a good place where we can more explicitly contextualize some of these options.

Do you mean to use dask_cuda.initialize.initialize rather than setting environment variables? If so, yes, they should be interchangeable.

Good! Just making sure. I think using initialize would probably make sense for the Python examples in the docs right now, since those are still setting the variables in the client code itself, just not with a Dask-CUDA utility. I definitely agree that clarifying environment variables can be set from the command line per-process for both the CLI and general Python code is good.

@charlesbluca
Copy link
Member

Opened up a draft PR #545

rapids-bot bot pushed a commit that referenced this issue Mar 10, 2021
Addressing #544, this PR aims to clarify the requirements, configuration, and usage of UCX with Dask-CUDA.

Still a lot to be done:

- [x] Flesh out hardware/software requirements
- [x] Rework CLI/Python usage examples
- [x] Clarify some uncertainties in the Configuration section
- [ ] Add standalone examples of UCX usage

Authors:
  - Charles Blackmon-Luca (@charlesbluca)

Approvers:
  - Peter Andreas Entschev (@pentschev)

URL: #545
@charlesbluca
Copy link
Member

With #545 merged, now we can focus specifically on what code examples could look like. Some that come to mind:

  • Single-node cluster with LocalCUDACluster
    • With NVLink and InfiniBand enabled
    • With either NVLink or InfiniBand
  • Single-node and multi-node clusters with dask-cuda-worker
    • ...

@pentschev
Copy link
Member Author

Just for completeness, #551 aims at adding samples to address this.

@charlesbluca
Copy link
Member

With #551 merged, is there anything else that could be done to improve the UCX docs?

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@charlesbluca
Copy link
Member

With dask/distributed#4683 merged, we now have the UCX configuring variables documented on Dask's main site. Personally, I feel like with this change it might be good to remove the documentation of these variables from our docs and simple direct users there for more information. Then, we can change our UCX configuration section to a condensed list of what variables are required and for what.

@pentschev
Copy link
Member Author

Sounds good to me, thanks @charlesbluca for keeping track of all this!

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@pentschev
Copy link
Member Author

We did largely improve UCX documentation over the past few months, thanks mostly to @charlesbluca . I don't see any immediate needs for additional documentation, therefore I'm closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress doc Documentation inactive-30d
Projects
None yet
Development

No branches or pull requests

2 participants