Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][compiled graphs] Add CPU-based NCCL communicator for development #47936

Closed
Tracked by #47983
stephanie-wang opened this issue Oct 8, 2024 · 7 comments · Fixed by #48440
Closed
Tracked by #47983

[core][compiled graphs] Add CPU-based NCCL communicator for development #47936

stephanie-wang opened this issue Oct 8, 2024 · 7 comments · Fixed by #48440
Assignees
Labels
beta Beta release feture compiled-graphs core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability good-first-issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks

Comments

@stephanie-wang
Copy link
Contributor

Description

For development and debugging, it's useful to be able to run compiled graphs that contain NCCL transport hints using a CPU-based communicator. The communicator could use the Ray object store / a Ray actor to perform p2p and collective ops.

Use case

No response

@stephanie-wang stephanie-wang added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) compiled-graphs good-first-issue Great starter issue for someone just starting to contribute to Ray labels Oct 8, 2024
@Bye-legumes
Copy link
Contributor

What is the different between this CPU-based NCCL communicator and this mock nccl in test? Or this is not for the test but fall back to CPU/shared memory is NCCL is not avaliable?

@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label Oct 16, 2024
@dengwxn
Copy link
Contributor

dengwxn commented Oct 17, 2024

Assigned to @tfsingh and @anyadontfly.
I think this is mainly for tests.
For starters, we should work on all-reduce first. Here's a good picture to explain collectives.

@stephanie-wang
Copy link
Contributor Author

Actually it would be good to make this work for non-testing purposes, so that users can debug DAGs with collective ops on CPU.

@tfsingh
Copy link
Contributor

tfsingh commented Oct 18, 2024

Commenting for assignment

@anyadontfly
Copy link
Contributor

Commenting for assignment

@rkooo567
Copy link
Contributor

who's going to take this task?

@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 24, 2024
@stephanie-wang
Copy link
Contributor Author

who's going to take this task?

@tfsingh and @anyadontfly are working on this.

@stephanie-wang stephanie-wang added the beta Beta release feture label Nov 27, 2024
stephanie-wang pushed a commit that referenced this issue Dec 19, 2024
…nt (#48440)

This allows developers to debug DAGs with collective ops on CPU.
Currently we use Ray actor to perform allreduce.

```python
num_workers = 2
workers = [CPUTorchTensorWorker.remote() for _ in range(num_workers)]
cpu_group = CPUCommunicator(num_workers, workers)

with InputNode() as inp:
    computes = [worker.compute.bind(inp) for worker in workers]
    collectives = collective.allreduce.bind(computes, transport=cpu_group)
    recvs = [worker.recv.bind(collective)  for worker, collective in zip(workers, collectives)]
    dag = MultiOutputNode(recvs)

compiled_dag = dag.experimental_compile()
ray.get(compiled_dag.execute())
```

Closes [#47936](#47936)
---------

Signed-off-by: tfsingh <[email protected]>
Co-authored-by: Tej Singh <[email protected]>
Co-authored-by: tfsingh <[email protected]>
Co-authored-by: Puyuan Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta Beta release feture compiled-graphs core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability good-first-issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants