-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review TreeNode Refactor #1
Review TreeNode Refactor #1
Conversation
Thanks for the fix @berkaysynnada, the changes look good/acceptable to me but let me check it thoroughly tomorrow and come back to you. |
self.expressions(), | ||
new_children.into_iter().map(|child| child.data).collect(), | ||
) | ||
.map(Transformed::yes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to propogate up the transformed
flag based on the detection result (old_children.into_iter().zip(new_children.iter()).any(|(c1, c2)| c1 != &c2.data)
) and not based on if the transformation closure (f
) applied on any children reported (propagated up)transformed
true state?
I mean I left similar TODOs in DynTreeNode
and ConcreteTreeNode
where the same conflic between the "detected" and "reported" transformed state can happen.
In my PR I always propagated up the trasnformed state from the closures despite the detection could be different. We can agree on to propagate the detected but then IMO we should change this in DynTreeNode
and ConcreteTreeNode
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant topic in the main PR: https://github.com/apache/arrow-datafusion/pull/8891/files/e1a499ab51aea3623d1c2754f16fb70e6176d119#r1504355259
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially thought there may exist some rules such that it transforms the tree in a few iterations, but at the final stage, the final data is the same as the initial data. Therefore, it would be better to check the data's itself rather than trusting the transformed flag. However, now I think that the equality check at each step could be expensive, and I think we need to rely on transformed flag. Otherwise it would have no meaning to carry.
I am converting it to propagate the transformed information up if any children is transformed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, my take on this general, user provided transformed
flag in the transformation closure results is that it is pointless. In the current code on the main
branch it doesn't work at all and so it isn't used anywhere for anything. This tells me that in most of the cases noone misses it. (You know I wanted to remove it entirelly first: apache#8835.)
Now the above doesn't mean that I can't imagine a case where we want to check if changes were made to the tree during a transformation. But I see other options without using the transformed
flag.
- If it is a tree with
Arc
"links" andenum
nodes likeLogicalPlan
we can compare the old cloned top level node with the new top level node we got back from transform to detect changes. This comparison is cheap (and similar to theold_children.into_iter().zip(new_children.iter()).any(|(c1, c2)| c1 != &c2.data)
above). - If it is a tree with
Arc<dyn trait>
nodes (Arc<DynTreeNode>
andConcreteTreeNode
trees fall into this category) then we can compare the old cloned top level node data ptr addres with the new top level node data prt address we got back from transform to detect changes. This is similar to howwith_new_children_if_necessary
works. - If we have an
Expr
tree (withBox
links) then it would be costly to clone the tree before the transformation so the only cheap way to detect if changes were made is to ask the API user to provide the information if they changed a node. But instead of requesting the transformed flag in all transformation closures from the API users, I would rather offer an new "transform with payload" like API (Transform with payload apache/datafusion#8664) to let the user propagate arbitrary state during transformation. Thetransformed
flag in theTransformed
struct is nothing more than a special payload in that regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The options you mentioned make sense, but instead of going through so much effort, we should use the already existing transform. I can't see any advantage of removing Transform either. I also use it a lot in the projection optimizer rule I'm currently working on. As far as I remember, there were also some ideas: apache#8835 (comment).
I have sent a new commit related with these. Could you please take a look? I think you are okey with propagating the transformed result according to the children flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm ok with keeping the flag, especially if you are going to use it in rules. But I have some concerns with that commit: #1 (comment)
/// ``` | ||
/// | ||
/// See [`TreeNodeRecursion`] for more details on how the traversal can be controlled. | ||
/// | ||
/// If `f_down` or `f_up` returns [`Err`], recursion is stopped immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to remove this? I really need this and I would rather keep this here than adding it back in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't like that we reuse the name transform
then we can rename it like transform_down_up
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing a new API, could we consider using transform_down
followed by transform_up
? I'm not trying to make things difficult for you, but I'm just curious if we could achieve the same result by adding an extra line of code where needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's not the same. I will need a short form of rewrite()
(top-down and bottom-up closures provided for one combined traversal). This is an example how I would like to use such an API:
apache#8891 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point. Not related with the topic, but I guess contains_pattern
in your example deep dives on each node until the leaves, so adding such an f_down
may get worse the performance actually.
However, in general there may exist such use cases where f_down can speed up the f_up phase. I'm reverting this change. Can I just ask from you to modify the transform adapting the same behavior with updated visit and rewrite? I have updated the test results, but cannot adapt the transform implementation yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calculating the pattern bitset will be very cheap. (BTW, I want to do that lazyly.) Also, I think in many cases several subtrees remain unchanged during transformations and the patterns of those subtrees don't need to be recalculated, which means that multiple subsequent transformations can take advantage of some patterns calculated during the first transformation.
I'm reverting this change. Can I just ask from you to modify the transform adapting the same behavior with updated visit and rewrite?
Thanks. Sure, I will do that. (Most likely only tomorrow...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sending a commit for the mentioned parts. Thanks for your cooperation :)
self.expressions(), | ||
new_children.into_iter().map(|child| child.data).collect(), | ||
) | ||
.map(Transformed::yes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially thought there may exist some rules such that it transforms the tree in a few iterations, but at the final stage, the final data is the same as the initial data. Therefore, it would be better to check the data's itself rather than trusting the transformed flag. However, now I think that the equality check at each step could be expensive, and I think we need to rely on transformed flag. Otherwise it would have no meaning to carry.
I am converting it to propagate the transformed information up if any children is transformed.
/// ``` | ||
/// | ||
/// See [`TreeNodeRecursion`] for more details on how the traversal can be controlled. | ||
/// | ||
/// If `f_down` or `f_up` returns [`Err`], recursion is stopped immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing a new API, could we consider using transform_down
followed by transform_up
? I'm not trying to make things difficult for you, but I'm just curious if we could achieve the same result by adding an extra line of code where needed.
Sure, thanks for the fixes! |
let new_children = children.into_iter().map_until_stop_and_collect(f)?; | ||
// Propagate up `new_children.transformed` and `new_children.tnr` | ||
// along with the node containing transformed children. | ||
if new_children.transformed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this very risky? You now are relying on if the transformation closures filled the transformed
flag correctly. What if they don't? So far noone used that flag for anything and noone checked if the flag is correcty filled. If a closure changes a child node but reports transformed=false (incorrectly) then the node's child will not be changed.
And such incorrect closures can exists not only in Apache DataFusion code, but anything that depends on DataFusion...
I think it would be safer to always call let t2 = self.with_new_arc_children(arc_self, t.data)?;
. What I'm undecided about is if we should return:
Ok(Transformed::new(t2.data, t.transformed, t.tnr))
- or
Ok(Transformed::new(t2.data, t2.transformed, t.tnr))
- or
Ok(Transformed::new(t2.data, t.transformed || t2.transformed, t.tnr))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand your concern, but we are providing an API and I think we should assume it is used correctly. If we provide a good documentation, I don't think there will be any problems. It is clear and not hard to understand, so I don't think someone would return modified data as transformed::no. Carrying this information along would be a burden if we're not going to use it.
Maybe we can change the default value of the methods that create new objects to transformed::yes, or we can close the direct access to data, and force users to modify it through a method (set_data()
), and when this method is run at least once, we can automatically set the transformed to yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, let's run the tests and see if the current closures use the flag correctly.
BTW, why is ConcreteTreeNode
different? As far as I see you didn't apply the if new_children.transformed
check there..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tnr
needs to come from new_children
, returning it from self.with_new_arc_children(arc_self, new_children.data)
will not work.
How about:
if new_children.transformed {
new_children.map_data(|data| {
let arc_self = Arc::clone(&self);
self.with_new_arc_children(arc_self, data).map(|t| t.data)
})
} else {
Ok(Transformed::no(self))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why is
ConcreteTreeNode
different? As far as I see you didn't apply theif new_children.transformed
check there..
You are right. I thought with_new_children
already does that propagation from children, but it only checks plans, not the data. I am updating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tnr
needs to come fromnew_children
, returning it fromself.with_new_arc_children(arc_self, new_children.data)
will not work.
if new_children.transformed
is true, I believe there is no possibility of that self.with_new_arc_children(arc_self, new_children.data)
returns transformed::no. Therefore I thought the existing implementation is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tnr
needs to come fromnew_children
, returning it fromself.with_new_arc_children(arc_self, new_children.data)
will not work.if
new_children.transformed
is true, I believe there is no possibility of thatself.with_new_arc_children(arc_self, new_children.data)
returns transformed::no. Therefore I thought the existing implementation is safe.
I'm not talking about .transformed
but about the .tnr
field of the result. self.with_new_arc_children(arc_self, new_children.data)
always returns TreeNodeRecursion::Continue
so we need to return .trn
from new_children
.
cb7f965
to
dbe24db
Compare
I believe this PR is ready after the |
Looks good to me now. Let me merge this PR and fix the I see that you renamed |
I tried to emphasize the difference of calling transform_down then transform_up with that. Renamed it transform_down_up only now. Maybe you can show up with better ones. It is an API that should be used with caution and I would like to emphasize that it is quite different from the old transform. The only TODO is converting transform and related macros to apply f_up on the jumped node also. |
Ok, thanks! I will merge this PR later today and fix the TODO. |
271a0fd
into
peter-toth:refactor-treenode-rewrite
* refactor `TreeNode::rewrite()` * use handle_tree_recursion in `Expr` * use macro for transform recursions * fix api * minor fixes * fix * don't trust `t.transformed` coming from transformation closures, keep the old way of detecting if changes were made * rephrase todo comment, always propagate up `t.transformed` from the transformation closure, fix projection pushdown closure * Fix `TreeNodeRecursion` docs * extend Skip (Prune) functionality to Jump as it is defined in https://synnada.notion.site/synnada/TreeNode-Design-Proposal-bceac27d18504a2085145550e267c4c1 * fix Jump and add tests * jump test fixes * fix clippy * unify "transform" traversals using macros, fix "visit" traversal jumps, add visit jump tests, ensure consistent naming `f` instead of `op`, `f_down` instead of `pre_visit` and `f_up` instead of `post_visit` * fix macro rewrite * minor fixes * minor fix * refactor tests * add transform tests * add apply, transform_down and transform_up tests * refactor tests * test jump on both a and e nodes in both top-down and bottom-up traversals * better transform/rewrite tests * minor fix * simplify tests * add stop tests, reorganize tests * fix previous merges and remove leftover file * Review TreeNode Refactor (#1) * Minor changes * Jump doesn't ignore f_up * update test * Update rewriter * LogicalPlan visit update and propagate from children flags * Update tree_node.rs * Update map_children's --------- Co-authored-by: Mustafa Akur <[email protected]> * fix * minor fixes * fix f_up call when f_down returns jump * simplify code * minor fix * revert unnecessary changes * fix `DynTreeNode` and `ConcreteTreeNode` `transformed` and `tnr` propagation * introduce TransformedResult helper * fix docs * restore transform as alias to trassform_up * restore transform as alias to trassform_up 2 * Simplifications and comment improvements (#2) --------- Co-authored-by: Berkay Şahin <[email protected]> Co-authored-by: Mustafa Akur <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Here is an explanation of the changes I made and their reasons:
visit
andrewrite
were skipping the f_up (post_visit) of the node if f_down (pre_visit) results with "jump". This behavior is inconsistent with the post_visit "jump"s. In two passes traversals, "jump" actually does jump to the next f_down if the operation is f_up, or jump to the next f_up if the operation is f_down (bypassing the nodes children). It is important to establish the same behavior.pop_enter_mark
method is adapted to updated behavior of jump.map_children
is resolved.transform
method. Old behavior of transform has been changed while its name is preserved. Since it is a simplification and ease-of-use PR rather than adding new features, we should add it with a separate PR when it is really needed. I also think that having the assumption of first top-down and then bottom-up may also be misleading for a general purpose API.handle_transform_recursion
anymore.