Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substrait: Combine join on and filter expressions in a single Substrait JoinRel's field #7611

Closed
nseekhao opened this issue Sep 20, 2023 · 4 comments · Fixed by #7612
Closed
Labels
enhancement New feature or request

Comments

@nseekhao
Copy link
Contributor

nseekhao commented Sep 20, 2023

Is your feature request related to a problem or challenge?

As of version 31.0.0, datafusion has a join on field which is used for equi-join conditions and a join filter field for non-equi-join conditions.

Currently, the producer puts the non-equi-join conditions in the Substrait post_join_filter. However, we can also combine all conditions and put it in the expression field of the JoinRel in Substrait.

Please refer to the motivation in the following comment. The motivation behind this change request is that this will let other DB systems decide what to do with the entire condition, as opposed to having to process them separately. Right now, if there is no equal condition, the producer will output just Literal(True) as the join expression, and put the rest of the condition in the post_join_filter. Having a redundant Literal expression adds unnecessary overhead of evaluating this condition. It also implies that you’re performing a cartesian product THEN a filter, as opposed to just a non-equi-join, which does not completely align with the original plan intent.

Describe the solution you'd like

All valid join conditions to be encoded in the Substrait JoinRel's expression field.

Describe alternatives you've considered

None.
The current approach works correctly semantically, but it can make downstream query execution inefficient.

Additional context

No response

@nseekhao
Copy link
Contributor Author

There is a few things in my description that I need to correct here.

  1. The issue should not have been labeled enhancement, but bug
  2. If the join on field is empty, the producer does not put Literal(True) in the field, it will just produce None
  3. The reason this issue should be labeled as bug is because the current use of post_join_filter is incorrect.

According to the ANSI join syntax (explained by IBM), there are two types of join filters: filter and post join filter.

  • Any predicates specified in the ON clause is considered a filter
  • Any predicates specified in the WHERE clause is considered a post_join_filter

In the case of an INNER join, the join filter and post_join_filter achieves the same results. However, in the case of LEFT join, the results will be different depending on whether a filter is pre/during or post join. Please refer to this reference for an example.

Since the current version of datafusion does not have a post-join filter field in a join relation, the datafusion-substrait producer should not produce a plan with a post_join_filter.

@nseekhao
Copy link
Contributor Author

@Dandandan @metesynnada @berkaysynnada Do you guys have any thoughts on this?

@berkaysynnada
Copy link
Contributor

@Dandandan @metesynnada @berkaysynnada Do you guys have any thoughts on this?

Thanks for the explanation :) @metesynnada may give you more accurate answers since has more experience with joins., but let me give you my thoughts.

"If the join on field is empty, the producer does not put Literal(True) in the field, it will just produce None"
Even if they are practically the same thing, I believe it will not be very correct if we give a different meaning to something that is None. Does producing as None cause an error?

"The reason this issue should be labeled as bug is because the current use of post_join_filter is incorrect."
AFAIK there is no distinction between post and pre-filters in datafusion joins, all non-equi filter predicates are collected in one place. Wouldn't it be misleading to make such a distinction in Substrait? Likewise, on and filter are different fields, and collecting them under a single roof can be confusing when producing and consuming. I have not encountered a such combination.

From your first explanation, I can understand why you need this but I'm unsure of the best solution.

@nseekhao
Copy link
Contributor Author

Even if they are practically the same thing, I believe it will not be very correct if we give a different meaning to something that is None. Does producing as None cause an error?

This is from the original code. There is actually nothing wrong with this part. I just described it incorrectly in the issue description so I wanted to correct it.

AFAIK there is no distinction between post and pre-filters in datafusion joins, all non-equi filter predicates are collected in one place.

You're exactly right. datafusion does not have a post-join filter. It only contains join on and filter fields. This is why the generated Substrait plan, if produced from datafusion should produce None in the post_join_filter field, since as you pointed out, datafusion has no notion of post-join filter. The current producer encodes the datafusion join filter as Substrait post_join_filter <-- this is incorrect.

The Join struct has only two fields for condition and according to the comments, neither of them are to be applied post-join.

    /// Equijoin clause expressed as pairs of (left, right) join expressions
    pub on: Vec<(Expr, Expr)>,
    /// Filters applied during join (non-equi conditions)
    pub filter: Option<Expr>,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants