-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][compiled-graphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs #48463
[core][compiled-graphs] Don't persist input_nodes in _CollectiveOperation to avoid wrong understanding about DAGs #48463
Conversation
Signed-off-by: kaihsun <[email protected]>
cc @dengwxn would you mind reviewing this PR? Thanks! |
LGTM! It is simple and direct change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good. Just to better understand, why "not all input_nodes belong to the args of the DAG node" in the case of collectives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only scan certain fields for dag_node.py::_collect_upstream_nodes()
, say not all dict entries of other_args_to_resolve
, but exclude COLLECTIVE_OPERATION_KEY
?
I think in any case we should mention in the docstring of dag_node.py::_collect_upstream_nodes()
its assumptions. Currently the assumption is all nodes appear in the following are considered as upstream nodes:
self._bound_args,
self._bound_kwargs,
self._bound_other_args_to_resolve,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in any case we should mention in the docstring of dag_node.py::_collect_upstream_nodes() its assumptions. Currently the assumption is all nodes appear in the following are considered as upstream nodes:
Agree on this. It sounds ok to not scan everything in other_args_to_resolve
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have considered skipping COLLECTIVE_OPERATION_KEY
. My concern is that it seems a bit odd for a basic class (DAGNode
) to implement logic from other classes built on top of it. Additionally, the code path applies to both DAG and ADAG. I am a bit worried about the complexity in the future if we add more and more ADAG-specific logic inside the shared code path. HDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you two think skipping the key is better, I will update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps add a new field _bound_other_args_not_to_resolve
and avoid scanning it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected upstream nodes for a CollectiveOutputNode
? Are they all the input nodes from all the actors, or simply the only one input node from the same actor?
This could potentially cause issues when compiling the graph.
What are the potential issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chatted with Kaihsun a bit yesterday, but +1 to Weixin's question.
I think the key issue is what's the definition of "upstream nodes", especially in the special case of collectives mentioned above. This definition needs to make sense based on how we use them in DAG and ADAG. Once this is clarified, we know what should be the right thing to do. @kevin85421 Can we define that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected upstream nodes for a CollectiveOutputNode? Are they all the input nodes from all the actors, or simply the only one input node from the same actor?
What are the potential issues?
I think for now the upstream nodes for a CollectiveOutputNode
should be the args
of the DAGNode
so that DAG and ADAG can have the same understanding for the same graph.
For example, compiled_dag_node.py
sets up the upstream/downstream relationship inside preprocess by treating args
as a DAGNode
's upstream nodes.
However, in dag_node.py
, all DAGNode
s inside self._bound_args
, self._bound_kwargs
, and self._bound_other_args_to_resolve
are considered as the upstream nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync offline with @ruisearch42 : update the comments, and open an issue to track the progress #48520.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our conclusion is introducing a new field only when we observe more and more issues are caused by the inconsistency.
The |
Signed-off-by: kaihsun <[email protected]>
…tion to avoid wrong understanding about DAGs (ray-project#48463) If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph.
…tion to avoid wrong understanding about DAGs (ray-project#48463) If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph. Signed-off-by: JP-sDEV <[email protected]>
…tion to avoid wrong understanding about DAGs (ray-project#48463) If we persist input_nodes in _CollectiveOperation, all input_nodes will be added to the upstream_nodes when building the DAG. However, not all input_nodes belong to the args of the DAG node. This could potentially cause issues when compiling the graph. Signed-off-by: mohitjain2504 <[email protected]>
Why are these changes needed?
If we persist
input_nodes
in_CollectiveOperation
, allinput_nodes
will be added to the upstream_nodes when building the DAG. However, not allinput_nodes
belong to the args of the DAG node. This could potentially cause issues when compiling the graph.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.