-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][compiled graphs] Add CPU-based NCCL communicator for development #47936
Comments
What is the different between this CPU-based NCCL communicator and this mock nccl in test? Or this is not for the test but fall back to CPU/shared memory is NCCL is not avaliable? |
Assigned to @tfsingh and @anyadontfly. |
Actually it would be good to make this work for non-testing purposes, so that users can debug DAGs with collective ops on CPU. |
Commenting for assignment |
Commenting for assignment |
who's going to take this task? |
@tfsingh and @anyadontfly are working on this. |
…nt (#48440) This allows developers to debug DAGs with collective ops on CPU. Currently we use Ray actor to perform allreduce. ```python num_workers = 2 workers = [CPUTorchTensorWorker.remote() for _ in range(num_workers)] cpu_group = CPUCommunicator(num_workers, workers) with InputNode() as inp: computes = [worker.compute.bind(inp) for worker in workers] collectives = collective.allreduce.bind(computes, transport=cpu_group) recvs = [worker.recv.bind(collective) for worker, collective in zip(workers, collectives)] dag = MultiOutputNode(recvs) compiled_dag = dag.experimental_compile() ray.get(compiled_dag.execute()) ``` Closes [#47936](#47936) --------- Signed-off-by: tfsingh <[email protected]> Co-authored-by: Tej Singh <[email protected]> Co-authored-by: tfsingh <[email protected]> Co-authored-by: Puyuan Yao <[email protected]>
Description
For development and debugging, it's useful to be able to run compiled graphs that contain NCCL transport hints using a CPU-based communicator. The communicator could use the Ray object store / a Ray actor to perform p2p and collective ops.
Use case
No response
The text was updated successfully, but these errors were encountered: