[core][compiled-graphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs #48463

kevin85421 · 2024-10-31T07:23:37Z

Why are these changes needed?

If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: kaihsun <[email protected]>

kevin85421 · 2024-10-31T07:25:40Z

cc @dengwxn would you mind reviewing this PR? Thanks!

dengwxn · 2024-10-31T18:12:48Z

LGTM! It is simple and direct change.

ruisearch42

The change looks good. Just to better understand, why "not all input_nodes belong to the args of the DAG node" in the case of collectives?

ruisearch42 · 2024-10-31T23:18:03Z

python/ray/dag/collective_node.py

Should we only scan certain fields for dag_node.py::_collect_upstream_nodes(), say not all dict entries of other_args_to_resolve, but exclude COLLECTIVE_OPERATION_KEY?

I think in any case we should mention in the docstring of dag_node.py::_collect_upstream_nodes() its assumptions. Currently the assumption is all nodes appear in the following are considered as upstream nodes:

self._bound_args, self._bound_kwargs, self._bound_other_args_to_resolve,

I think in any case we should mention in the docstring of dag_node.py::_collect_upstream_nodes() its assumptions. Currently the assumption is all nodes appear in the following are considered as upstream nodes:

Agree on this. It sounds ok to not scan everything in other_args_to_resolve.

I have considered skipping COLLECTIVE_OPERATION_KEY. My concern is that it seems a bit odd for a basic class (DAGNode) to implement logic from other classes built on top of it. Additionally, the code path applies to both DAG and ADAG. I am a bit worried about the complexity in the future if we add more and more ADAG-specific logic inside the shared code path. HDYT?

If you two think skipping the key is better, I will update the PR.

Or perhaps add a new field _bound_other_args_not_to_resolve and avoid scanning it?

What's the expected upstream nodes for a CollectiveOutputNode? Are they all the input nodes from all the actors, or simply the only one input node from the same actor?

This could potentially cause issues when compiling the graph.

What are the potential issues?

I chatted with Kaihsun a bit yesterday, but +1 to Weixin's question.

I think the key issue is what's the definition of "upstream nodes", especially in the special case of collectives mentioned above. This definition needs to make sense based on how we use them in DAG and ADAG. Once this is clarified, we know what should be the right thing to do. @kevin85421 Can we define that?

What's the expected upstream nodes for a CollectiveOutputNode? Are they all the input nodes from all the actors, or simply the only one input node from the same actor?
What are the potential issues?

I think for now the upstream nodes for a CollectiveOutputNode should be the args of the DAGNode so that DAG and ADAG can have the same understanding for the same graph.

For example, compiled_dag_node.py sets up the upstream/downstream relationship inside preprocess by treating args as a DAGNode's upstream nodes.

However, in dag_node.py, all DAGNodes inside self._bound_args, self._bound_kwargs, and self._bound_other_args_to_resolve are considered as the upstream nodes.

sync offline with @ruisearch42 : update the comments, and open an issue to track the progress #48520.

Our conclusion is introducing a new field only when we observe more and more issues are caused by the inconsistency.

dengwxn · 2024-11-01T00:24:17Z

The change looks good. Just to better understand, why "not all input_nodes belong to the args of the DAG node" in the case of collectives?

The input_nodes is passed to the helper class _CollectiveOperation, and the _CollectiveOperation is passed to the CollectiveOutputNode. In this way, the collective related logic is hided in _CollectiveOperation. There is only one _CollectiveOperation created for all the CollectiveOutputNode.

Signed-off-by: kaihsun <[email protected]>

…tion to avoid wrong understanding about DAGs (ray-project#48463) If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph.

…tion to avoid wrong understanding about DAGs (ray-project#48463) If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph. Signed-off-by: JP-sDEV <[email protected]>

…tion to avoid wrong understanding about DAGs (ray-project#48463) If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph. Signed-off-by: mohitjain2504 <[email protected]>

update

0f86c11

Signed-off-by: kaihsun <[email protected]>

kevin85421 assigned ruisearch42 and rkooo567 Oct 31, 2024

kevin85421 marked this pull request as ready for review October 31, 2024 07:26

kevin85421 added the go add ONLY when ready to merge, run all tests label Oct 31, 2024

ruisearch42 reviewed Oct 31, 2024

View reviewed changes

update

76d4ece

Signed-off-by: kaihsun <[email protected]>

kevin85421 mentioned this pull request Nov 3, 2024

[core][compiled-graphs] Revisit the definition of upstream / downstream definitions in DAG and RayCG #48520

Open

ruisearch42 approved these changes Nov 3, 2024

View reviewed changes

rkooo567 merged commit 3581e62 into ray-project:master Nov 4, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][compiled-graphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs #48463

[core][compiled-graphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs #48463

kevin85421 commented Oct 31, 2024 •

edited

Loading

kevin85421 commented Oct 31, 2024

dengwxn commented Oct 31, 2024

ruisearch42 left a comment •

edited

Loading

ruisearch42 Oct 31, 2024

dengwxn Nov 1, 2024

kevin85421 Nov 1, 2024

kevin85421 Nov 1, 2024

kevin85421 Nov 1, 2024

dengwxn Nov 1, 2024 •

edited

Loading

ruisearch42 Nov 1, 2024 •

edited

Loading

kevin85421 Nov 1, 2024

kevin85421 Nov 3, 2024

kevin85421 Nov 3, 2024

dengwxn commented Nov 1, 2024

[core][compiled-graphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs #48463

[core][compiled-graphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs #48463

Conversation

kevin85421 commented Oct 31, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

kevin85421 commented Oct 31, 2024

dengwxn commented Oct 31, 2024

ruisearch42 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dengwxn Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

ruisearch42 Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dengwxn commented Nov 1, 2024

kevin85421 commented Oct 31, 2024 •

edited

Loading

ruisearch42 left a comment •

edited

Loading

dengwxn Nov 1, 2024 •

edited

Loading

ruisearch42 Nov 1, 2024 •

edited

Loading