Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use schema_name to create the physical_name #11977

Merged
merged 1 commit into from
Aug 21, 2024

Conversation

joroKr21
Copy link
Contributor

@joroKr21 joroKr21 commented Aug 14, 2024

More consistency and less opportunity for column name mismatch.

Which issue does this PR close?

Part of #11782.

Rationale for this change

It's not manageable to have these two slightly different ways of obtaining a name from an expression.
Also, minor differences in the implementation can cause schema mismatch errors.

What changes are included in this PR?

physical_name delegates to schema_name for all cases except top-level columns which are unqualified.
This was previously the purpose of the is_first_expr flag in create_physical_name.

Are these changes tested?

Relying on existing tests.

Are there any user-facing changes?

  1. Expr::canonical_name is deprecated because it's just an alias for format!
  2. create_function_physical_name is removed

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Aug 14, 2024
More consistency and less opportunity for column name mismatch.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @joroKr21 -- this is an amazing PR (amazing to delete so much code but not change any tests -- that really shows a deep understanding of the code)

cc @mustafasrepo and @jayzhan211 as I vaguely remember we discussed the need for this alternate path for display expressions. Clearly as all the tests pass the second copy isn't needed it seems

@@ -1104,6 +1103,7 @@ impl Expr {
}

/// Returns a full and complete string representation of this expression.
#[deprecated(note = "use format! instead")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/// The name of the column (field) that this `Expr` will produce in the physical plan.
/// The difference from [Expr::schema_name] is that top-level columns are unqualified.
pub fn physical_name(expr: &Expr) -> Result<String> {
if let Expr::Column(col) = expr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@@ -2179,6 +2179,7 @@ mod tests {
.map(|order_by_expr| {
let ordering_req = order_by_expr.unwrap_or_default();
AggregateExprBuilder::new(array_agg_udaf(), vec![Arc::clone(col_a)])
.alias("a")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the issue, we are not able to get the correct name from the args, so alias is a workaround solution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need it? Is there a situation where we would not provide the alias to the physical expression?
I made this change because the alias generated previously was incorrect (it didn't use the arguments).

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 14, 2024

We should get physical name from Physical expression not from Expr. If we are building the expr from builder, there is no Expr that you could get the name from.
Btw, if you are interesting to solve this issue, let me know, so I can work on other task

@joroKr21
Copy link
Contributor Author

If we are building the expr from builder, there is no Expr that you could get the name from.

Is there a use case for this? The way I see it, physical expressions are low-level enough that you would have to explicitly provide an alias. But I'm curious to know.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 15, 2024

If we are building the expr from builder, there is no Expr that you could get the name from.

Is there a use case for this? The way I see it, physical expressions are low-level enough that you would have to explicitly provide an alias. But I'm curious to know.

In datafusion-comet, they heavily rely on physical layer stuffs. They directly build the physical expression with out logical expression.

you would have to explicitly provide an alias

I guess it is possible to create the name based on arguments for most of the case, and they can also alias the unreadable complex expression to a nicer one.

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

Maybe @andygrove / @viirya could also provide some feedback -- I don't understand the complexities of the naming requirements in comet

If there are some other non obvious requirements for comet, it would be great to get some additional tests in DataFusion that demonstrate the usecase so that we don't break something for comet accidentally.

internal_err!("Create physical name does not support OuterReferenceColumn")
}
/// The name of the column (field) that this `Expr` will produce in the physical plan.
/// The difference from [Expr::schema_name] is that top-level columns are unqualified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if we can provide example like schema_name to show difference.

@viirya
Copy link
Member

viirya commented Aug 15, 2024

This was previously the purpose of the is_first_expression flag in create_physical_name.

I don't see the appearance of is_first_expression flag in the diff. So the description looks a bit confused.

@viirya
Copy link
Member

viirya commented Aug 15, 2024

Maybe @andygrove / @viirya could also provide some feedback -- I don't understand the complexities of the naming requirements in comet

If there are some other non obvious requirements for comet, it would be great to get some additional tests in DataFusion that demonstrate the usecase so that we don't break something for comet accidentally.

As Comet doesn't go through the query analysis of DataFusion, I think it should be fine from name resolution change.

create_physical_name(e, true)
}

fn create_physical_name(e: &Expr, is_first_expr: bool) -> Result<String> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya this was the flag, I remembered it incorrectly: is_first_expr

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to have a consistent approach for physical name for expression, instead of both physical_name and schema_name.

@alamb
Copy link
Contributor

alamb commented Aug 19, 2024

What is the current status of this PR?

I am a little confused

@viirya seems to suggest this would be a good change, but @jayzhan211 points out this change would mean anyone who built PhysicalExprs without a corresponding Expr would have to create their own schema names

We should get physical name from Physical expression not from Expr. If we are building the expr from builder, there is no Expr that you could get the name from.

So I guess my question is "if we merge this PR would it mess up comet (or other users of only physical exprs)?"

@viirya
Copy link
Member

viirya commented Aug 19, 2024

@viirya seems to suggest this would be a good change, but @jayzhan211 points out this change would mean anyone who built PhysicalExprs without a corresponding Expr would have to create their own schema names

Hmm, as we built physical operators directly (without going through analysis stage in DataFusion), the bindings are done at Spark analysis and Comet uses the bindings when creating physical expressions and operators. So we don't rely on name resolution in DataFusion.

That's said, I assume that names of physical expressions don't matter in physical operators. The change to DataFusion analysis resolution should not have impact. Do we still have additional name resolution in physical stage in DataFusion?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 20, 2024

Do we still have additional name resolution in physical stage in DataFusion?

The name is part of the schema and is verified against column name inside projection mapping and it is also the name displayed in explain statement

let idx = col.index();
let matching_input_field = input_schema.field(idx);
if col.name() != matching_input_field.name() {
return internal_err!("Input field name {} does not match with the projection expression {}",
matching_input_field.name(),col.name())
}
let matching_input_column =
Column::new(matching_input_field.name(), idx);
Ok(Transformed::yes(Arc::new(matching_input_column)))

@viirya
Copy link
Member

viirya commented Aug 20, 2024

It looks like a simple check that if a column's name matches corresponding input field name in input schema. Isn't? As we provide matched input schema and column names to projection, I think it should be all good. I'm not sure how does this PR impact us.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 20, 2024

I'm not sure how does this PR impact us.

It will not. Merge this PR does not impact Comet.

Do we still have additional name resolution in physical stage in DataFusion?

The response for this is just to inform we still have name resolution in physical stage in Datafusion so the name of physical expressions matters and might not be a good idea to get it from the logical expression directly.

Upd: I think we could merge this PR too, we can change it later on with the counter example if there is

@alamb
Copy link
Contributor

alamb commented Aug 20, 2024

Ok, I'll plan to merge this PR tomorrow unless anyone else has a different opinion

Thank you all for the discussion. I am glad we got to a good place

@alamb alamb merged commit 1c7209b into apache:main Aug 21, 2024
26 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 21, 2024

🚀

@alamb alamb added the api change Changes the API exposed to users of the crate label Aug 21, 2024
@alamb
Copy link
Contributor

alamb commented Aug 21, 2024

Marked as API change and updated description to note it removes create_function_physical_name which is public: https://docs.rs/datafusion/latest/datafusion/logical_expr/expr/fn.create_function_physical_name.html

@joroKr21 joroKr21 deleted the bugfix/expr-name branch August 21, 2024 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants