Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][compiled-graphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs #48463

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Oct 31, 2024

Why are these changes needed?

If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: kaihsun <[email protected]>
@kevin85421
Copy link
Member Author

cc @dengwxn would you mind reviewing this PR? Thanks!

@kevin85421 kevin85421 marked this pull request as ready for review October 31, 2024 07:26
@dengwxn
Copy link
Contributor

dengwxn commented Oct 31, 2024

LGTM! It is simple and direct change.

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Oct 31, 2024
Copy link
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good. Just to better understand, why "not all input_nodes belong to the args of the DAG node" in the case of collectives?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only scan certain fields for dag_node.py::_collect_upstream_nodes(), say not all dict entries of other_args_to_resolve, but exclude COLLECTIVE_OPERATION_KEY?

I think in any case we should mention in the docstring of dag_node.py::_collect_upstream_nodes() its assumptions. Currently the assumption is all nodes appear in the following are considered as upstream nodes:

                self._bound_args,
                self._bound_kwargs,
                self._bound_other_args_to_resolve,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in any case we should mention in the docstring of dag_node.py::_collect_upstream_nodes() its assumptions. Currently the assumption is all nodes appear in the following are considered as upstream nodes:

Agree on this. It sounds ok to not scan everything in other_args_to_resolve.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have considered skipping COLLECTIVE_OPERATION_KEY. My concern is that it seems a bit odd for a basic class (DAGNode) to implement logic from other classes built on top of it. Additionally, the code path applies to both DAG and ADAG. I am a bit worried about the complexity in the future if we add more and more ADAG-specific logic inside the shared code path. HDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you two think skipping the key is better, I will update the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps add a new field _bound_other_args_not_to_resolve and avoid scanning it?

Copy link
Contributor

@dengwxn dengwxn Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected upstream nodes for a CollectiveOutputNode? Are they all the input nodes from all the actors, or simply the only one input node from the same actor?

This could potentially cause issues when compiling the graph.

What are the potential issues?

Copy link
Contributor

@ruisearch42 ruisearch42 Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chatted with Kaihsun a bit yesterday, but +1 to Weixin's question.

I think the key issue is what's the definition of "upstream nodes", especially in the special case of collectives mentioned above. This definition needs to make sense based on how we use them in DAG and ADAG. Once this is clarified, we know what should be the right thing to do. @kevin85421 Can we define that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected upstream nodes for a CollectiveOutputNode? Are they all the input nodes from all the actors, or simply the only one input node from the same actor?
What are the potential issues?

I think for now the upstream nodes for a CollectiveOutputNode should be the args of the DAGNode so that DAG and ADAG can have the same understanding for the same graph.

For example, compiled_dag_node.py sets up the upstream/downstream relationship inside preprocess by treating args as a DAGNode's upstream nodes.

However, in dag_node.py, all DAGNodes inside self._bound_args, self._bound_kwargs, and self._bound_other_args_to_resolve are considered as the upstream nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync offline with @ruisearch42 : update the comments, and open an issue to track the progress #48520.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our conclusion is introducing a new field only when we observe more and more issues are caused by the inconsistency.

@dengwxn
Copy link
Contributor

dengwxn commented Nov 1, 2024

The change looks good. Just to better understand, why "not all input_nodes belong to the args of the DAG node" in the case of collectives?

The input_nodes is passed to the helper class _CollectiveOperation, and the _CollectiveOperation is passed to the CollectiveOutputNode. In this way, the collective related logic is hided in _CollectiveOperation. There is only one _CollectiveOperation created for all the CollectiveOutputNode.

Signed-off-by: kaihsun <[email protected]>
@rkooo567 rkooo567 merged commit 3581e62 into ray-project:master Nov 4, 2024
5 checks passed
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
…tion to avoid wrong understanding about DAGs (ray-project#48463)

If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph.
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…tion to avoid wrong understanding about DAGs (ray-project#48463)

If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph.

Signed-off-by: JP-sDEV <[email protected]>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…tion to avoid wrong understanding about DAGs (ray-project#48463)

If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph.

Signed-off-by: mohitjain2504 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants