-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed-ucxx subproject #60
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
9a1c8ee
to
111ba7d
Compare
08e271d
to
2a89d6f
Compare
2a89d6f
to
4da42c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some small suggestions but broadly this looks good to me.
Things that I think are worthwhile fixing:
- remove
versioneer
stuff - rename
test_ucx.py -> test_ucxx.py
await _worker_close(*args, **kwargs) | ||
|
||
if worker._protocol.startswith("ucxx") and worker.nanny is not None: | ||
_stop_notifier_thread_and_progress_tasks() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presume it does not matter if ucxx.stop_notifier_thread()
is called multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it's a no-op if already stopped.
# TODO: We don't know if any frames are CUDA, investigate whether | ||
# we need to synchronize device here. | ||
frames = await self.ep.recv_multi() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the receiving side, surely the fact that one has received the message is sufficient to sync the relevant stream. Since recv_multi
allocates the received frames internally and doesn't expose a stream, it should behave synchronously I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point, unfortunately this is something I overlooked and filed issue #112 to handle that.
[tool.versioneer] | ||
VCS = "git" | ||
style = "pep440" | ||
versionfile_source = "distributed_ucxx/_version.py" | ||
versionfile_build = "distributed_ucxx/_version.py" | ||
tag_prefix = "v" | ||
parentdir_prefix = "distributed_ucxx-" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the same approach as elsewhere and not introduce versioneer? Which I think we've migrated away from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, I'll handle that in a follow-up PR. For now I'll keep it as is just to have packages installable for other projects to test.
Co-authored-by: Lawrence Mitchell <[email protected]>
/merge |
Proposes removing the build-time dependency on `tomli` for wheels and conda packages. It doesn't appear to be used anywhere here. ```shell git grep tomli ``` ## Notes for Reviewers I originally noticed something similar in `ucx-py` (rapidsai/ucx-py#1042), then went searching for similar cases across RAPIDS. That dependency was added for `distributed-ucxx` back in #60. I'm not sure why, but I suspect it was related to the use of `versioneer` in this project at the time. Reference: python-versioneer/python-versioneer#338 (comment) This project doesn't use `versioneer` any more (#114). I strongly suspect that the dependency on `tomli` can be removed. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Lawrence Mitchell (https://github.com/wence-) - https://github.com/jakirkham - Ray Douglass (https://github.com/raydouglass) URL: #228
Add new subproject
distributed-ucxx
, providing a plugin for Distributed with a newprotocol="ucxx"
that can be specified by the user to enable UCXX as backend for communication. This is completely independent of UCX-Py for now, which may still be chosen by specifyingprotocol="ucx"
.Most of the changes here are actually reimplementing the UCX-Py backend from Distributed, with minor changes such as
ucp
->ucxx
and to adapt to API changes in UCXX. The tests in this PR are also those that currently test UCX-Py in Distributed, similarly withucp
->ucxx
and API adaptations.Packaging and distribution may still require further work that will be addressed in follow-up PRs.