-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve UCX documentation and samples #544
Comments
Happy to take this on! I don't have as much experience setting up UCX through the CLI, but we can use this issue to discuss that when the time comes. |
Just now looking into this! A question on the UCX configuation - are the listed environment variables being processed through ucx:
cuda-copy: true # required
tcp: true # required
nvlink: true # required for NVLink
infiniband: true # required for InfiniBand
rdmacm: true # recommended for IB
net-devices: 'mlx5_0:1' # important to set to 'mlx*****' with IB enabled, otherwise Ethernet device
rmm:
pool-size: 1GB # recommended to prevent Dask scheduler crashes |
I agree that the configuration reference page could be improved with descriptions for the sections that say "No Comment". However, I think we still should list them in our docs because it's very easy to introduce typos, for example note that there is a double-underscore after
This is yet another thing I would normally prefer not to. We can always add that as an option for users, but not necessarily suggest it by default -- and by the way, these configurations already live in |
That was my motivation behind listing the variables in their Dask config form (e.g.
You're right, it's better to encourage explicitly setting these variables on a per-run basis rather than setting implicitly for all runs. In that case, I think a good way to showcase setting the options without environment variables would be to use |
Actually maybe the client configuration would be better suited for |
To provide some hope here, I expect that in the mid-term we can greatly reduce the number of options we need to pass to UCX as things become more stable. With that said, I would be happy enough if we have good docs and samples now even if they're a bit convoluted. I think in either case the number of variables that need to be set are the biggest problem.
Do you mean to use |
I agree, samples are definitely a good place where we can more explicitly contextualize some of these options.
Good! Just making sure. I think using |
Opened up a draft PR #545 |
Addressing #544, this PR aims to clarify the requirements, configuration, and usage of UCX with Dask-CUDA. Still a lot to be done: - [x] Flesh out hardware/software requirements - [x] Rework CLI/Python usage examples - [x] Clarify some uncertainties in the Configuration section - [ ] Add standalone examples of UCX usage Authors: - Charles Blackmon-Luca (@charlesbluca) Approvers: - Peter Andreas Entschev (@pentschev) URL: #545
With #545 merged, now we can focus specifically on what code examples could look like. Some that come to mind:
|
Just for completeness, #551 aims at adding samples to address this. |
With #551 merged, is there anything else that could be done to improve the UCX docs? |
This issue has been labeled |
With dask/distributed#4683 merged, we now have the UCX configuring variables documented on Dask's main site. Personally, I feel like with this change it might be good to remove the documentation of these variables from our docs and simple direct users there for more information. Then, we can change our UCX configuration section to a condensed list of what variables are required and for what. |
Sounds good to me, thanks @charlesbluca for keeping track of all this! |
This issue has been labeled |
We did largely improve UCX documentation over the past few months, thanks mostly to @charlesbluca . I don't see any immediate needs for additional documentation, therefore I'm closing this for now. |
We currently have some documentation on using UCX with Dask-CUDA available in https://dask-cuda.readthedocs.io/en/latest/ucx.html . However, I feel that documentation is a bit convoluted, so reviewing and improving its text is a good idea. Furthermore, it could be easier for readers if they have a small set of samples they can refer to inside the dask-cuda repository, perhaps under a new directory
samples/ucx
, where we could have a few simple scripts to run a single-node cluster withLocalCUDACluster
, as well as one or more scripts to rundask-scheduler
+dask-cuda-worker
(possibly more than just one, for multi-node clusters) + client code.@charlesbluca is this something you would be interested in doing? There's no rush on it, I just thought someone like you would be perfect for it, as you still don't have the bias on writing all the options by heart. 🙂
The text was updated successfully, but these errors were encountered: