Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed-ucxx subproject #60

Merged
merged 15 commits into from
Oct 30, 2023

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Jun 14, 2023

Add new subproject distributed-ucxx, providing a plugin for Distributed with a new protocol="ucxx" that can be specified by the user to enable UCXX as backend for communication. This is completely independent of UCX-Py for now, which may still be chosen by specifying protocol="ucx".

Most of the changes here are actually reimplementing the UCX-Py backend from Distributed, with minor changes such as ucp->ucxx and to adapt to API changes in UCXX. The tests in this PR are also those that currently test UCX-Py in Distributed, similarly with ucp->ucxx and API adaptations.

Packaging and distribution may still require further work that will be addressed in follow-up PRs.

@copy-pr-bot

This comment was marked as outdated.

@pentschev pentschev changed the base branch from branch-0.33 to branch-0.35 October 5, 2023 10:34
@pentschev pentschev force-pushed the distributed branch 3 times, most recently from 08e271d to 2a89d6f Compare October 26, 2023 17:05
@pentschev pentschev marked this pull request as ready for review October 30, 2023 12:36
@pentschev pentschev requested review from a team as code owners October 30, 2023 12:36
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some small suggestions but broadly this looks good to me.

Things that I think are worthwhile fixing:

  • remove versioneer stuff
  • rename test_ucx.py -> test_ucxx.py

await _worker_close(*args, **kwargs)

if worker._protocol.startswith("ucxx") and worker.nanny is not None:
_stop_notifier_thread_and_progress_tasks()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presume it does not matter if ucxx.stop_notifier_thread() is called multiple times.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it's a no-op if already stopped.

Comment on lines +376 to +378
# TODO: We don't know if any frames are CUDA, investigate whether
# we need to synchronize device here.
frames = await self.ep.recv_multi()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the receiving side, surely the fact that one has received the message is sufficient to sync the relevant stream. Since recv_multi allocates the received frames internally and doesn't expose a stream, it should behave synchronously I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, unfortunately this is something I overlooked and filed issue #112 to handle that.

python/distributed-ucxx/distributed_ucxx/ucxx.py Outdated Show resolved Hide resolved
python/distributed-ucxx/distributed_ucxx/ucxx.py Outdated Show resolved Hide resolved
python/distributed-ucxx/distributed_ucxx/ucxx.py Outdated Show resolved Hide resolved
Comment on lines +111 to +117
[tool.versioneer]
VCS = "git"
style = "pep440"
versionfile_source = "distributed_ucxx/_version.py"
versionfile_build = "distributed_ucxx/_version.py"
tag_prefix = "v"
parentdir_prefix = "distributed_ucxx-"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the same approach as elsewhere and not introduce versioneer? Which I think we've migrated away from.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I'll handle that in a follow-up PR. For now I'll keep it as is just to have packages installable for other projects to test.

python/distributed-ucxx/distributed_ucxx/tests/test_ucx.py Outdated Show resolved Hide resolved
@pentschev
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 7af9f19 into rapidsai:branch-0.35 Oct 30, 2023
18 checks passed
@pentschev pentschev deleted the distributed branch October 30, 2023 17:13
rapids-bot bot pushed a commit that referenced this pull request May 7, 2024
Proposes removing the build-time dependency on `tomli` for wheels and conda packages.

It doesn't appear to be used anywhere here.

```shell
git grep tomli
```

## Notes for Reviewers

I originally noticed something similar in `ucx-py` (rapidsai/ucx-py#1042), then went searching for similar cases across RAPIDS.

That dependency was added for `distributed-ucxx` back in #60. I'm not sure why, but I suspect it was related to the use of `versioneer` in this project at the time. Reference: python-versioneer/python-versioneer#338 (comment)

This project doesn't use `versioneer` any more (#114). I strongly suspect that the dependency on `tomli` can be removed.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - https://github.com/jakirkham
  - Ray Douglass (https://github.com/raydouglass)

URL: #228
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants