Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LogicalPlan::SubqueryAlias #2172

Merged
merged 9 commits into from
Apr 12, 2022
Merged

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Apr 6, 2022

Which issue does this PR close?

Closes ##2164.

Rationale for this change

I would like the logical plan to better represent aliased relations to help with translation to a different execution engine. Currently, original table names are discarded during SQL planning.

What changes are included in this PR?

Introduces new SubqueryAlias node in LogicalPlan and updates the SQL planner to wrap this around TableScan rather than just use the alias name in the TableScan.

Are there any user-facing changes?

The query plan will look different now.

@andygrove andygrove self-assigned this Apr 6, 2022
@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate labels Apr 6, 2022
@@ -518,6 +518,18 @@ impl LogicalPlanBuilder {
})))
}

/// Apply an alias
pub fn alias(&self, alias: &str) -> Result<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way that this same aliasing can be represented is a Projection node

So if the input has schema with columns a, b, c

Then to implement alias("foo") you could build a LogicalPlanNode that was Project([a foo.a, b as foo.b, c as foo.c])

There may be some good reason to introduce a new type of LogicalPlanNode too that I don't understand, but I wanted to point out this alternative

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and I see you don't want to do that :) in the description

I wonder if you could use the DFSchema::relation name for this purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need something to very specifically say that a table is being used as an alias. An easier change might be just to add an additional field to the TableScan to record the original table name. I think I will put up a separate PR for that approach.

So to explain why I need this. I want to use DataFusion as a SQL parser and planner but I want to execute the query in a different engine. I can provide a TableProvider so that DataFuision can get the schema for table person and I get a logical plan. When I go to translate that plan to a physical plan, it refers to a table called peeps (the alias) and I have no way to know the actual table name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the Spark plan for this use case:

'Project [*]
+- 'SubqueryAlias peeps
   +- 'UnresolvedRelation [person], [], false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the approach taken in this PR, as it could handle subqueries rather than just table scans (as suggested by the name).

I think this is a cleaner, more generic approach.

@mingmwang
Copy link
Contributor

Looks like in DataFusion, the logical Project node already support Alias. It should be similar to SparkSQL's SubqueryAlias.
And we can also use it to represent a NamedStruct


pub struct Projection {
    /// The list of expressions
    pub expr: Vec<Expr>,
    /// The incoming logical plan
    pub input: Arc<LogicalPlan>,
    /// The schema description of the output
    pub schema: DFSchemaRef,
    /// Projection output relation alias
    pub alias: Option<String>,
}


@andygrove andygrove changed the title Add LogicalPlan::AliasedRelation Add LogicalPlan::SubqeryAlias Apr 10, 2022
@andygrove andygrove marked this pull request as ready for review April 10, 2022 20:16
@andygrove andygrove changed the title Add LogicalPlan::SubqeryAlias Add LogicalPlan::SubqueryAlias Apr 10, 2022
@andygrove
Copy link
Member Author

@alamb @Dandandan @mingmwang I have fixed the test failures and renamed AliasedRelation to SubqueryAlias. This is only implemented for TableScan so far but if this approach is acceptable, I will file follow-on issues to use the same approach for projection and union, which both currently have their own alias fields.

// take alias into account to support `JOIN table1 as table2`
alias
.as_ref()
.map(|a| a.name.value.as_str())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an improvement to me -- I think the projection pushdown code might benefit from an additional test, but I also think the existing coverage is likely sufficient.

@@ -432,6 +432,34 @@ fn optimize_plan(
alias: alias.clone(),
}))
}
LogicalPlan::SubqueryAlias(SubqueryAlias { input, alias, .. }) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth a test for projection pushdown specifically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Test added.

@@ -3458,7 +3458,8 @@ mod tests {
let expected = "Projection: #person.first_name, #person.id\
\n Inner Join: Using #person.id = #person2.id\
\n TableScan: person projection=None\
\n TableScan: person2 projection=None";
\n SubqueryAlias: person2\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the appearance of SubqueryAlias in this query is somewhat strange to me as there is no subquery in this SQL (maybe a better name is RelationAlias or something)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally named this AliasedRelation and then changed it to SubqueryAlias because that is what Spark uses. I don't have a strong preference either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me neither. Whatever you prefer is fine with me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's leave it as is for now since we will make use of it for subqueries (soon, hopefully).

I filed follow-on issues for use this approach for projection and union:

@alamb
Copy link
Contributor

alamb commented Apr 11, 2022

Thanks @andygrove

@andygrove andygrove merged commit ee95d41 into apache:master Apr 12, 2022
@andygrove andygrove deleted the aliased-relation-wip branch April 12, 2022 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants