[core][compiled graphs] Add CPU-based NCCL communicator for development #48440

anyadontfly · 2024-10-30T08:36:07Z

Why are these changes needed?

This allows developers to debug DAGs with collective ops on CPU. Currently we use Ray actor to perform allreduce.

num_workers = 2
workers = [CPUTorchTensorWorker.remote() for _ in range(num_workers)]
cpu_group = CPUCommunicator(num_workers, workers)

with InputNode() as inp:
    computes = [worker.compute.bind(inp) for worker in workers]
    collectives = collective.allreduce.bind(computes, transport=cpu_group)
    recvs = [worker.recv.bind(collective)  for worker, collective in zip(workers, collectives)]
    dag = MultiOutputNode(recvs)

compiled_dag = dag.experimental_compile()
ray.get(compiled_dag.execute())

Related issue number

Closes #47936

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…er of allreduce ops; minor change: sesddefaultdict on line63 for simplicity"; added line 94 for unsupported collective ops

python/ray/experimental/channel/cpu_nccl_group.py

stephanie-wang · 2024-10-31T04:39:08Z

Could you also add a test for this?

AndyUB

LGTM.

python/ray/experimental/channel/cpu_nccl_group.py

…e op of CPUNcclGroup

…ce self.get_communicator in CPUNcclGroup

tfsingh · 2024-11-22T04:51:26Z

Just a heads up that @anyadontfly and I are writing some e2e tests with the compiled dag API, and should have them done by tomorrow.

stephanie-wang

Nice, this looks pretty good! Left some comments for cleanup/clarification but overall the structure looks good.

python/ray/experimental/channel/cpu_nccl_group.py

stephanie-wang · 2024-11-28T05:57:06Z

python/ray/experimental/channel/cpu_nccl_group.py

+        self.collective_data: Dict[int, List[torch.Tensor]] = defaultdict(list)
+        # Buffer for the number of actors seen, each entry is one p2p op.
+        self.num_actors_seen = defaultdict(int)
+        # Number of actors who have read the result, and are about the exit the function.


Suggested change

# Number of actors who have read the result, and are about the exit the function.

# Number of actors who have read the result, and are about to exit the function.

python/ray/experimental/channel/cpu_nccl_group.py

stephanie-wang · 2024-11-28T06:01:27Z

python/ray/experimental/channel/cpu_nccl_group.py

+        self.communicators.add(comm)
+
+        received_tensor = ray.get(comm.wait_p2p.remote(self.num_ops[comm_key]))
+        assert (


In this case you can probably just directly return the received_tensor (allocator is needed for cases where a receive buffer needs to be allocated before the recv happens).

python/ray/experimental/channel/cpu_nccl_group.py

stephanie-wang · 2024-11-28T06:09:55Z

python/ray/experimental/channel/torch_tensor_nccl_channel.py

-    assert (
-        ray.get_gpu_ids()
-    ), "Actors participating in NCCL group must have at least one GPU assigned"
+    if not custom_nccl_group or not isinstance(custom_nccl_group, CPUNcclGroup):


Suggested change

if not custom_nccl_group or not isinstance(custom_nccl_group, CPUNcclGroup):

if not (custom_nccl_group and isinstance(custom_nccl_group, CPUNcclGroup)):

nit, a bit more readable with a single not.

python/ray/experimental/channel/torch_tensor_nccl_channel.py

python/ray/experimental/channel/cpu_nccl_group.py

python/ray/dag/tests/experimental/test_cpu_communicator_dag.py

stephanie-wang · 2024-12-05T19:11:38Z

python/ray/experimental/channel/cpu_communicator.py

+        return result
+
+
+class CPUCommunicator:


Can we make this inherit from GPUCommunicator?

One suggestion is to rename the GPUCommunicator into a generic Communicator or DeviceCommunicator, and you can have it return a string of the expected resource type that actors in the group should have.

stephanie-wang · 2024-12-05T19:12:39Z

python/ray/experimental/channel/torch_tensor_nccl_channel.py

@@ -572,17 +574,46 @@ def _do_init_nccl_group(
        )


+def _do_init_cpu_group(


Can you see if we can reuse the existing _do_init_nccl_group instead?

stephanie-wang · 2024-12-05T19:14:36Z

python/ray/dag/collective_node.py

-        recv_buf = torch.empty_like(send_buf)
-        nccl_group.allreduce(send_buf, recv_buf, self._op)
+        ctx = ChannelContext.get_current()
+        if ctx.nccl_groups:


Same here, I think you can restructure this to just take the "default group", whether it's NCCL or CPU, and then the code inside the if-else branches is the same for both.

stephanie-wang · 2024-12-05T19:19:30Z

python/ray/experimental/channel/common.py

        self.nccl_groups: Dict[str, "GPUCommunicator"] = {}
+        # Used for the torch.Tensor CPU transport.
+        self.cpu_groups: Dict[str, "CPUCommunicator"] = {}


// group ID -> Communicator
Option 1: self.device_groups: Dict[str, Communicator]

// resource label -> group ID -> Communicator
Option2: self.device_groups: Dict[str, str, Communicator]

can we keep the name self.nccl_groups?

…icator recv_stream and send_ctream raise NotImplementedError

stephanie-wang

Thanks, this looks great! A couple minor comments about naming then we can merge it. Let's have get_device_type return "gpu" instead of "nccl".

stephanie-wang · 2024-12-13T05:22:16Z

python/ray/experimental/channel/communicator.py

+    @abstractmethod
+    def get_device_type() -> str:
+        """
+        Return the type of the communicator (nccl or cpu).


Suggested change

Return the type of the communicator (nccl or cpu).

Return the type of the communicator (gpu or cpu).

stephanie-wang · 2024-12-13T05:23:54Z

python/ray/experimental/channel/nccl_group.py

@@ -307,3 +304,6 @@ def destroy(self) -> None:
            # flag is True when they exit from the abort.
            self._comm.abort()
            self._comm.destroy()
+
+    def get_device_type(self) -> str:
+        return "nccl"


Suggested change

return "nccl"

return "gpu"

"nccl" is the transport name but "gpu" is the device that each actor is expected to have.

You could add a get_transport_name and that can return NCCL instead.

stephanie-wang · 2024-12-13T05:28:14Z

python/ray/dag/compiled_dag_node.py

@@ -856,7 +856,7 @@ def __init__(
        self._use_default_nccl_group = False
        # This is set to the specified custom nccl group
        # if there exists a type hint of `transport=nccl_group`.
-        self._custom_nccl_group_p2p: Optional[GPUCommunicator] = None
+        self._custom_nccl_group_p2p: Optional[Communicator] = None


For consistency, it would be good to replace-all nccl_group with communicator.

Got it. Also we have functions for initializing and destroying CPUCommunicator in torch_tensor_nccl_channel.py. Do we have to change the file name of torch_tensor_nccl_channel.py or to put those functions in a separate file?

Signed-off-by: tfsingh <[email protected]>

tfsingh and others added 3 commits October 28, 2024 22:46

Initial work

df0f227

Undo sorting

b043233

usiedlist of ranks as barrier_key instead of type of operation + numb…

b7c01c2

…er of allreduce ops; minor change: sesddefaultdict on line63 for simplicity"; added line 94 for unsupported collective ops

stephanie-wang reviewed Oct 31, 2024

View reviewed changes

stephanie-wang assigned stephanie-wang and ruisearch42 Oct 31, 2024

tfsingh and others added 4 commits November 4, 2024 19:53

Initial response to review

00f0f4e

Merge branch 'master' into py-ts/cpu-nccl

2528f86

Condense code

fc3e810

Add todo

1da7d1d

AndyUB reviewed Nov 7, 2024

View reviewed changes

python/ray/experimental/channel/cpu_nccl_group.py Outdated Show resolved Hide resolved

python/ray/experimental/channel/cpu_nccl_group.py Outdated Show resolved Hide resolved

python/ray/experimental/channel/cpu_nccl_group.py Outdated Show resolved Hide resolved

anyadontfly and others added 11 commits November 7, 2024 02:30

added test for CPUCommunicator, changed communicator key for allreduc…

fa00f9d

…e op of CPUNcclGroup

Fix (some) lint errors

095c307

Changes during meeting

d2bffa5

Merge branch 'master' into py-ts/cpu-nccl

04707ed

Tests passing, but incorrect

26b387b

Add working test

e12b500

Merge branch 'master' into py-ts/cpu-nccl

df16818

Reset conftest.py

5e12e22

Add newline

0d7b2a4

added allreduce test, used Actor.options(get_if_exists=True) to repla…

932c4e5

…ce self.get_communicator in CPUNcclGroup

minor changes

ddcdaf8

jcotant1 added core Issues that should be addressed in Ray Core compiled-graphs labels Nov 18, 2024

Remove time

5c9a67f

tfsingh force-pushed the py-ts/cpu-nccl branch from cc3b0ce to 5c9a67f Compare November 19, 2024 04:00

tfsingh added 2 commits November 18, 2024 20:00

Merge branch 'master' into py-ts/cpu-nccl

dfa8257

Merge branch 'master' into py-ts/cpu-nccl

4f99dd2

tfsingh and others added 3 commits November 26, 2024 20:01

Merge branch 'master' into py-ts/cpu-nccl

f3baf27

Merge branch 'master' into py-ts/cpu-nccl

9ff534b

Respond to review

ed3b27f

stephanie-wang reviewed Nov 28, 2024

View reviewed changes

anyadontfly and others added 7 commits November 30, 2024 20:39

no longer depend on mock nccl, test on wrong shape not passing

289eb6c

change name from cpu_nccl_group to cpu_communicator

affb790

Fix allreduce test

ec7eba5

Remove p2p ops on CPUCommunicator

9a3569c

Merge branch 'master' into py-ts/cpu-nccl

9de82d3

reformat code

857c2de

Move import inside class

c6f47d0

stephanie-wang reviewed Dec 5, 2024

View reviewed changes

tfsingh and others added 9 commits December 5, 2024 21:58

Unify codepaths

81fddec

Rename file

c211427

Small fixes

fbf1388

clean up, lint and format

e11cfe7

added get_device_type func for custom nccl groups in tests, CPUCommun…

ca6fda8

…icator recv_stream and send_ctream raise NotImplementedError

Merge branch 'master' into py-ts/cpu-nccl

944f522

Swap torch

5dada98

lint fix

3b614dd

Merge branch 'master' into py-ts/cpu-nccl

fa7dc1e

stephanie-wang approved these changes Dec 13, 2024

View reviewed changes

rename nccl_group to communicator

a401eff

stephanie-wang added the go add ONLY when ready to merge, run all tests label Dec 17, 2024

tfsingh and others added 4 commits December 17, 2024 10:49

Merge branch 'master' into py-ts/cpu-nccl

d7fca94

Signed-off-by: tfsingh <[email protected]>

small fix

c30e02a

small fix

42a7171

Merge branch 'master' into py-ts/cpu-nccl

89e3f43

Signed-off-by: tfsingh <[email protected]>

stephanie-wang changed the title ~~(WIP) [core][compiled graphs] Add CPU-based NCCL communicator for development~~ [core][compiled graphs] Add CPU-based NCCL communicator for development Dec 19, 2024

stephanie-wang merged commit 6676cc1 into ray-project:master Dec 19, 2024
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][compiled graphs] Add CPU-based NCCL communicator for development #48440

[core][compiled graphs] Add CPU-based NCCL communicator for development #48440

anyadontfly commented Oct 30, 2024 •

edited

Loading

stephanie-wang commented Oct 31, 2024

AndyUB left a comment

tfsingh commented Nov 22, 2024

stephanie-wang left a comment

stephanie-wang Nov 28, 2024

stephanie-wang Nov 28, 2024

stephanie-wang Nov 28, 2024

stephanie-wang Dec 5, 2024

stephanie-wang Dec 5, 2024

stephanie-wang Dec 5, 2024

stephanie-wang Dec 5, 2024

stephanie-wang Dec 5, 2024

anyadontfly Dec 5, 2024 •

edited

Loading

stephanie-wang left a comment

stephanie-wang Dec 13, 2024

stephanie-wang Dec 13, 2024

stephanie-wang Dec 13, 2024

anyadontfly Dec 15, 2024

	# Number of actors who have read the result, and are about the exit the function.
	# Number of actors who have read the result, and are about to exit the function.

	if not custom_nccl_group or not isinstance(custom_nccl_group, CPUNcclGroup):
	if not (custom_nccl_group and isinstance(custom_nccl_group, CPUNcclGroup)):

		@@ -572,17 +574,46 @@ def _do_init_nccl_group(
		)


		def _do_init_cpu_group(

	Return the type of the communicator (nccl or cpu).
	Return the type of the communicator (gpu or cpu).

[core][compiled graphs] Add CPU-based NCCL communicator for development #48440

[core][compiled graphs] Add CPU-based NCCL communicator for development #48440

Conversation

anyadontfly commented Oct 30, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang commented Oct 31, 2024

AndyUB left a comment

Choose a reason for hiding this comment

tfsingh commented Nov 22, 2024

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anyadontfly Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anyadontfly commented Oct 30, 2024 •

edited

Loading

anyadontfly Dec 5, 2024 •

edited

Loading