-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use schema_name
to create the physical_name
#11977
Conversation
More consistency and less opportunity for column name mismatch.
b71e491
to
60f9ad9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @joroKr21 -- this is an amazing PR (amazing to delete so much code but not change any tests -- that really shows a deep understanding of the code)
cc @mustafasrepo and @jayzhan211 as I vaguely remember we discussed the need for this alternate path for display expressions. Clearly as all the tests pass the second copy isn't needed it seems
@@ -1104,6 +1103,7 @@ impl Expr { | |||
} | |||
|
|||
/// Returns a full and complete string representation of this expression. | |||
#[deprecated(note = "use format! instead")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/// The name of the column (field) that this `Expr` will produce in the physical plan. | ||
/// The difference from [Expr::schema_name] is that top-level columns are unqualified. | ||
pub fn physical_name(expr: &Expr) -> Result<String> { | ||
if let Expr::Column(col) = expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
@@ -2179,6 +2179,7 @@ mod tests { | |||
.map(|order_by_expr| { | |||
let ordering_req = order_by_expr.unwrap_or_default(); | |||
AggregateExprBuilder::new(array_agg_udaf(), vec![Arc::clone(col_a)]) | |||
.alias("a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the issue, we are not able to get the correct name from the args, so alias is a workaround solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need it? Is there a situation where we would not provide the alias to the physical expression?
I made this change because the alias generated previously was incorrect (it didn't use the arguments).
We should get physical name from Physical expression not from |
Is there a use case for this? The way I see it, physical expressions are low-level enough that you would have to explicitly provide an alias. But I'm curious to know. |
In datafusion-comet, they heavily rely on physical layer stuffs. They directly build the physical expression with out logical expression.
I guess it is possible to create the name based on arguments for most of the case, and they can also alias the unreadable complex expression to a nicer one. |
Maybe @andygrove / @viirya could also provide some feedback -- I don't understand the complexities of the naming requirements in comet If there are some other non obvious requirements for comet, it would be great to get some additional tests in DataFusion that demonstrate the usecase so that we don't break something for comet accidentally. |
internal_err!("Create physical name does not support OuterReferenceColumn") | ||
} | ||
/// The name of the column (field) that this `Expr` will produce in the physical plan. | ||
/// The difference from [Expr::schema_name] is that top-level columns are unqualified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better if we can provide example like schema_name
to show difference.
I don't see the appearance of |
As Comet doesn't go through the query analysis of DataFusion, I think it should be fine from name resolution change. |
create_physical_name(e, true) | ||
} | ||
|
||
fn create_physical_name(e: &Expr, is_first_expr: bool) -> Result<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya this was the flag, I remembered it incorrectly: is_first_expr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to have a consistent approach for physical name for expression, instead of both physical_name
and schema_name
.
What is the current status of this PR? I am a little confused @viirya seems to suggest this would be a good change, but @jayzhan211 points out this change would mean anyone who built
So I guess my question is "if we merge this PR would it mess up comet (or other users of only physical exprs)?" |
Hmm, as we built physical operators directly (without going through analysis stage in DataFusion), the bindings are done at Spark analysis and Comet uses the bindings when creating physical expressions and operators. So we don't rely on name resolution in DataFusion. That's said, I assume that names of physical expressions don't matter in physical operators. The change to DataFusion analysis resolution should not have impact. Do we still have additional name resolution in physical stage in DataFusion? |
The name is part of the schema and is verified against column name inside projection mapping and it is also the name displayed in explain statement datafusion/datafusion/physical-expr/src/equivalence/projection.rs Lines 66 to 74 in 67cf1d6
|
It looks like a simple check that if a column's name matches corresponding input field name in input schema. Isn't? As we provide matched input schema and column names to projection, I think it should be all good. I'm not sure how does this PR impact us. |
It will not. Merge this PR does not impact Comet.
The response for this is just to inform we still have name resolution in physical stage in Datafusion so the name of physical expressions matters and might not be a good idea to get it from the logical expression directly. Upd: I think we could merge this PR too, we can change it later on with the counter example if there is |
Ok, I'll plan to merge this PR tomorrow unless anyone else has a different opinion Thank you all for the discussion. I am glad we got to a good place |
🚀 |
Marked as API change and updated description to note it removes |
More consistency and less opportunity for column name mismatch.
Which issue does this PR close?
Part of #11782.
Rationale for this change
It's not manageable to have these two slightly different ways of obtaining a name from an expression.
Also, minor differences in the implementation can cause schema mismatch errors.
What changes are included in this PR?
physical_name
delegates toschema_name
for all cases except top-level columns which are unqualified.This was previously the purpose of the
is_first_expr
flag increate_physical_name
.Are these changes tested?
Relying on existing tests.
Are there any user-facing changes?
Expr::canonical_name
is deprecated because it's just an alias forformat!
create_function_physical_name
is removed