Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Improve the join keys of logical plan #4389

Closed
ygf11 opened this issue Nov 27, 2022 · 18 comments · Fixed by #4602
Closed

Proposal: Improve the join keys of logical plan #4389

ygf11 opened this issue Nov 27, 2022 · 18 comments · Fixed by #4602
Labels
enhancement New feature or request

Comments

@ygf11
Copy link
Contributor

ygf11 commented Nov 27, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Join keys in LogicalPlan::Join are columns now, which is the same as physical plan.
To add features base on it, we need add additional projections in logical plan level, like #4193 and #4353. There are some drawbacks:

  • Join keys may need alias when do type coercion, and the alias will display in the logical plan (also make display verbose).
  • Our logical plan will be verbose with a lot additional projections.
  • In optimizing stage, we have to check the additional projections if we want to optimize joins.

Join key need alias when do type coercion

One join key may happen one more time in join on condition, then it may also need do type coercion one more time, to distinguish each position, we have to add alias for them.
For example, we have:

--- t0.a t1.a -> UInt8
--- t0.b t1.b -> Int16
--- t0.c t1.c -> Int32
select t0.a, t1.a from test0 as t0
                  inner join test1 as t1
                  on t0.a = t1.b and t0.a = t1.c      

then the logical plan will:

-- before optimizer
Projection: t0.a, t1.a                                                                                                                                                                                 
 Inner Join: t0.a = t1.b, t0.a = t1.c                                                                                                                                                             
  Projection: t0.a, t0.b, t0.c                                                                                                                   
   SubqueryAlias: t0                                                                                                                                                                                
    TableScan: test0 projection=[a,b,c]                                                                                                                                                               
   SubqueryAlias: t1 

-- after type coercion
Projection: t0.a, t1.a                                                                                                                                                                                 
 Inner Join: t0.a#0 = t1.b, t0.a#1 = t1.c                                                                                                                                                             
  Projection: t0.a, CAST(t0.a AS Int16) AS t0.a#0, CAST(t0.a AS Int32) AS t0.a#1                                                                                                                     
   SubqueryAlias: t0                                                                                                                                                                                
    TableScan: test0 projection=[a]                                                                                                                                                               
   SubqueryAlias: t1                                                                                                                                                                                 

This makes our logical plan more complex.

logical plan is verbose

for sql:

select t0.a, t1.b
       from test0 as t0
       inner join test1 as t1
       on t0.a + 1 = t1.a * 2;

the logical plan will be like(does not do type coercion):

Projection: t0.a, t1.b                                                                                                                          
 Inner Join: t0.a + Int64(1) = t1.a * Int64(2)                                                                                                        
  Projection: t0.a, CAST(t0.a AS Int64) + Int64(1)                                                                                                  
   SubqueryAlias: t0                                                                                                                                
    TableScan: test0 projection=[a]                                                                                                               
  Projection: t1.b, CAST(t1.a AS Int64) * Int64(2)                                                                                                   
   SubqueryAlias: t1                                                                                                                                
    TableScan: test1 projection=[a, b]                                                                                                             

Two projections will be added before join.

Describe the solution you'd like
I would like to fold the additional projections to join(logical-plan), and keep the physical join as before. I think it can make our logical plan clean and easy to extend.

/// Join two logical plans on one or more join columns
#[derive(Clone)]
pub struct Join {
    ...
    /// Equijoin clause expressed as pairs of (left, right) join columns
    pub on: Vec<(Expr,Expr)>,
   ...
}

After I check the current code base, the main changes are the optimizers:

  • eliminate_cross_join.
  • filter_null_join_key.
  • filter_push_down.
  • projection_pushdown.
  • subquery_filter_to_join.

We can step by step to finish it, and refactor the LogicalPlan::Join finally.

Describe alternatives you've considered

Additional context

@ygf11 ygf11 added the enhancement New feature or request label Nov 27, 2022
@ygf11
Copy link
Contributor Author

ygf11 commented Nov 27, 2022

@alamb
Copy link
Contributor

alamb commented Nov 27, 2022

I think this makes a lot of sense to me. Thank you @ygf11 .

I would suggest a few other fields as well:

  1. For equijoins as you propose
  2. For other predicates that must be evaluated in the join (for outer joins, these have special semantics)

This design also extends naturally to adding other specialized join predicates in the future (eg. range joins)

/// Join two logical plans on one or more join columns
#[derive(Clone)]
pub struct Join {
    ...
    /// Equijoin clause expressed as pairs of (left, right) join exprs
    pub equi_preds: Vec<(Expr,Expr)>,

    /// other (non equi) join predicates
    pub on_preds: Vec<Expr>,
   ...
}

@jackwener
Copy link
Member

jackwener commented Nov 27, 2022

A meaning proposal ! Thanks @ygf11 .

Agree with @alamb, some database is design like this.

presto

    private final List<EquiJoinClause> criteria;
    private final Optional<Expression> filter;

doris

    private final List<Expr> hashJoinCondition;
    private final List<Expr> otherJoinCondition;

@ygf11
Copy link
Contributor Author

ygf11 commented Nov 28, 2022

Thanks for your suggestion @alamb @jackwener, I will work on it.

@liukun4515
Copy link
Contributor

equi_preds

in the spark just the

case class Join(
    left: LogicalPlan,
    right: LogicalPlan,
    joinType: JoinType,
    condition: Option[Expression],
    hint: JoinHint)

cc @ygf11

@liukun4515
Copy link
Contributor

liukun4515 commented Nov 30, 2022

Can we change the logical plan of join to presto or doris? and extract the on condition to the option<expr>

If we can change the pub on: Vec<(column,column)> to option<expr>, we don't need to do the join type coercion specifically for the expr in the Join plan.

@liukun4515
Copy link
Contributor

😭, I also confused about we split the join to join and crossjoin in the logical phase, I think we can combine these two together and just add crossjoin join_type for this. This is a historical issue and don't know when the datafusion add the cross join in the logical plan.

I think it's necessary to refactor this to combine them together in the logical phase, but it will bring a api break changes in the logical.proto.

@andygrove @alamb @Dandandan

@ygf11
Copy link
Contributor Author

ygf11 commented Nov 30, 2022

If we can change the pub on: Vec<(column,column)> to option, we don't need to do the #4353 specifically for the expr in the Join plan.

Thank you @liukun4515.
Avoid adding type coercion specifically for join makes sense to me, but we need split the expression to join keys every time we use it, this is a regular operation in optimizer.

Another way to avoid adding the special type coercion, is:

  • Still use Vec<(Expr, Expr)> in LogicalPlan::Join.
  • Combine the left and right join key to equality expression in logical_plan.expressions().
  • Split the casted equality expression to left and right join key in from_plan when doing type coercion.

I'm not sure if this way will change a lot, so need some tries.

@mingmwang
Copy link
Contributor

mingmwang commented Nov 30, 2022

😭, I also confused about we split the join to join and crossjoin in the logical phase, I think we can combine these two together and just add crossjoin join_type for this. This is a historical issue and don't know when the datafusion add the cross join in the logical plan.

I think it's necessary to refactor this to combine them together in the logical phase, but it will bring a api break changes in the logical.proto.

@andygrove @alamb @Dandandan

I would prefer to keep the current two logical joins, join and crossjoin, but change the equal join conditions Vec<(Expr, Expr)> to an Option<Vec<(Expr, Expr)>>

@liukun4515
Copy link
Contributor

liukun4515 commented Nov 30, 2022

If we can change the pub on: Vec<(column,column)> to option, we don't need to do the #4353 specifically for the expr in the Join plan.

Thank you @liukun4515. Avoid adding type coercion specifically for join makes sense to me, but we need split the expression to join keys every time we use it, this is a regular operation in optimizer.

Another way to avoid adding the special type coercion, is:

  • Still use Vec<(Expr, Expr)> in LogicalPlan::Join.
  • Combine the left and right join key to equality expression in logical_plan.expressions().

make sense for me, we can try it.

  • Split the casted equality expression to left and right join key in from_plan when doing type coercion.

I'm not sure if this way will change a lot, so need some tries.

Thanks

@liukun4515
Copy link
Contributor

😭, I also confused about we split the join to join and crossjoin in the logical phase, I think we can combine these two together and just add crossjoin join_type for this. This is a historical issue and don't know when the datafusion add the cross join in the logical plan.
I think it's necessary to refactor this to combine them together in the logical phase, but it will bring a api break changes in the logical.proto.
@andygrove @alamb @Dandandan

I would prefer to keep the current two logical joins, join and crossjoin, but change the equal join conditions Vec<(Expr, Expr)> to an Option<Vec<(Expr, Expr)>>

cc @ygf11 we should make the equal join condition to Option, there is bug for join.

@Dandandan
Copy link
Contributor

Wouldn't an empty list for the join conditions effectively the same as None? We don't have to make it an Option but just use a join with an empty Vec for cross joins.

@ygf11
Copy link
Contributor Author

ygf11 commented Dec 1, 2022

cc @ygf11 we should make the equal join condition to Option, there is #4363 for join.

Currently the planner will convert join to cross join when equal Join condition is empty.
I guess the reason is that we do not have a Physical Plan(e.g. Nested-loop-join) to handle join whose equal Join condition is empty. So If we add the implementation of Nested-loop-join, I think the bug will be fixed.
https://github.com/apache/arrow-datafusion/blob/3fe542f2afcd5360edc9abb7ad1e8243b560a6b2/datafusion/sql/src/planner.rs#L769-L776

I also have the same confuse of empty list and None<Vec<_>>, I think Vec is enough, can you please explain more? @liukun4515 @mingmwang

@liukun4515
Copy link
Contributor

cc @ygf11 we should make the equal join condition to Option, there is #4363 for join.

Currently the planner will convert join to cross join when equal Join condition is empty. I guess the reason is that we do not have a Physical Plan(e.g. Nested-loop-join) to handle join whose equal Join condition is empty. So If we add the implementation of Nested-loop-join, I think the bug will be fixed.

Yes, If i have time, and will do this.

https://github.com/apache/arrow-datafusion/blob/3fe542f2afcd5360edc9abb7ad1e8243b560a6b2/datafusion/sql/src/planner.rs#L769-L776

I also have the same confuse of empty list and None<Vec<_>>, I think Vec is enough, can you please explain more? @liukun4515 @mingmwang

Vec is enough.

@mingmwang
Copy link
Contributor

@ygf11
I'm OK with Vec.

One thing that I'm not very clear is that you will change to use Vec<(Expr, Expr)> in LogicalPlan::Join, but still keep Vec<(Column, Column)> in physical HashJoinExec, there will be gaps between logic plan and physical plan, how you will
address the gap ?

@ygf11
Copy link
Contributor Author

ygf11 commented Dec 1, 2022

One thing that I'm not very clear is that you will change to use Vec<(Expr, Expr)> in LogicalPlan::Join, but still keep Vec<(Column, Column)> in physical HashJoinExec, there will be gaps between logic plan and physical plan, how you will
address the gap

I plan to add physical projection when transforming logical plan to physical plan.

@mingmwang
Copy link
Contributor

One thing that I'm not very clear is that you will change to use Vec<(Expr, Expr)> in LogicalPlan::Join, but still keep Vec<(Column, Column)> in physical HashJoinExec, there will be gaps between logic plan and physical plan, how you will
address the gap

I plan to add physical projection when transforming logical plan to physical plan.

Thanks for the explanation.

@alamb
Copy link
Contributor

alamb commented Dec 1, 2022

I think @liukun4515 's suggestion of #4389 (comment) is consistent with what I have seen in other systems. I don't really know why we have Join / CrossJoin in the logical plan and I think it would make more sense to encode that information in a single LogicalPlan node

That being said, I don't have a strong opinion on the matter and I leave the decision to those doing the work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants