Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement foundational filter selectivity analysis #3868

Merged
merged 2 commits into from
Oct 20, 2022

Conversation

isidentical
Copy link
Contributor

@isidentical isidentical commented Oct 18, 2022

Which issue does this PR close?

Part of #3845.

Rationale for this change

DataFusion have some cost based optimizer rules (that operate on the physical plan) where statistics are leveraged into picking a new version of that plan with less 'possible' cost. One great example is the side swapping on hash join where we try to make the build side as small as possible. These rules work reasonably well when there is no interference between the joins and the table provider (since joins can now estimate their output cardinality) but one common thing that can cut this flow is a filter node which currently does not support producing statistics (as shown in @Dandandan 's example in #3845).

What changes are included in this PR?

This PR does not implement the statistics propagation for filters, but I can also include that. The main reasoning was that, since it is very self-contained (the new expression statistics API and the filter selectivity analysis), it could be had separately and then we can build the filter statistics estimation methods on top of it. If it makes sense to include it here as well, please let me know (or if this PR is too big by itself, I can also split it further, depending on what is easier to review!)

Expression Statistics

Since we needed a way of propagating statistics at the expression level (rather than the plan level), we now have a new API that is more suitable for it. Instead of dealing with individual columns, this API deals with boundaries of expression results. For the initial work, this is implemented for the following expressions (but also can be extended further (knowing the boundary of a + 1 or max(a) is easier once we have these 3):

  • Literals, where the boundary is min/max are the value that it holds and distinct_count is 1
  • Columns, where the boundary is the same as the boundary from the column stats (which are passed from the plan into the expressions, when using this API).
  • Comparisons (=, >, >=, <, <=) with one column and one literal (actually not a literal, but an expression which can have its boundary reduced to single scalar). Using the filter selectivity analysis (that is also implemented in this PR) we estimate the min/max values (and the selectivity of the statistics) we can reason about the expressions lower and upper bounds (as well as its selectivity)

There is also a new API for updating the statistics of a column with new, known boundaries on a separate context. This is currently not used anywhere (although implemented and tested), but its primary functionality will be limiting the known maximums and minimums once we have support for composite predicates (A = 50 AND B > A, can now see that A is actually 50, at least in that context, so that different contexts can apply between different splits of predicates).

Filter Selectivity Analysis

The selectivity analysis here is based on the uniform distribution assumption (since we currently don't have histograms, and they might never come in the short term) and uses the basic column bounds to match the ranges with this assumption.

Considering that $ is a uniformly distributed column (e.g. list(range(1, 100+1))), the selectivity of each operation is the following:

$ = [1, 100] min max selectivity (formula) selectivity
$ = 75 75 75 1 / range($) %1
$ >= 75 75 max($) = 100 ((max($) - 50) + 1) / range($) %26
$ > 75 76* max($) = 100 (max($) - 50) / range($) %25
$ <= 75 min($) = 1 75 ((75 - min($)) + 1) / range($) %75
$ < 75 min($) = 1 74* (75 - min($)) / range($) %74

This uses set bounds and tries to assume partial matches would be in the intersections (which is easier when we know one of the sets has a static bound, {75}) and with this, we produce three different values: a new min (or the old one, depending on the operator), a new max (same), and a filter selectivity rate (percentage of rows that the given expression would select if it is was used as a predicate on a uniformly distributed row, the non-uniform percentages might be different than the ones above [if you are interested in playing out, I have a very small script that shows them]).

Low Priority / Other (but still needed)

  • Moving ColumnStatistics and Statistics out of core/physical_plan (and into datafusion-common). They are still exposed from core/physical_plan to not to break the public API.

  • Implementing a general absolute distance method that works with numeric scalars, scalar.distance(other) would return usize(|scalar - other|). This API is something that I saw was needed quite a bit when dealing with statistics, but if it is a bit too much (since we already have scalar.sub), I could also be persuaded into making it a utility method somewhere else.

  • Operator.swap() for returning a swapped version of the same operator that can function if lhs/rhs are also swapped.

Are there any user-facing changes?

Quite a bit if you count the new APIs, if not, not much.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions labels Oct 18, 2022
@isidentical isidentical marked this pull request as ready for review October 18, 2022 13:11
@alamb
Copy link
Contributor

alamb commented Oct 18, 2022

Thanks @isidentical -- I will try and review this tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @isidentical -- I think this is a really exciting step forward for DataFusion

This PR does not implement the statistics propagation for filters, but I can also include that. ... If it makes sense to include it here as well, please let me know (or if this PR is too big by itself, I can also split it further, depending on what is easier to review!)

It is great that you are splitting the PRs up to make them easier to review. Thank you for both explaining the context as well as trying to keep them in reviewable pieces

The only part of this PR I am not sure about is the calculation of min/max values for the boolean binary expressions -- however, I think it would also be fine to merge this PR as is and iterate on master.

datafusion/core/src/physical_plan/mod.rs Show resolved Hide resolved
datafusion/physical-expr/src/expressions/binary.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/expressions/binary.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/expressions/binary.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/expressions/binary.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/physical_expr.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/physical_expr.rs Show resolved Hide resolved
datafusion/physical-expr/src/physical_expr.rs Show resolved Hide resolved
datafusion/physical-expr/src/physical_expr.rs Show resolved Hide resolved
datafusion/physical-expr/src/physical_expr.rs Outdated Show resolved Hide resolved
@isidentical
Copy link
Contributor Author

isidentical commented Oct 19, 2022

Thank you so much for your reviews @alamb and @Dandandan! I'll try to address them, but yes, I am also quite excited for the expression boundary analysis framework 🚀

The only part of this PR I am not sure about is the calculation of min/max values for the boolean binary expressions -- however, I think it would also be fine to merge this PR as is and iterate on master.

@alamb would you mind giving an example? Expression boundaries for a boolean field (like a, or true) would be ExpressionBoundaries { min: true, max: true, distinct: 1}. And for a filter that does a < 5, we would (or should, unless I am missing a scenario) not return a result (since they are from different datatypes, this analysis is strictly for examples where both datatypes are the same).

In any case, I'd be happy to also add a test for that case as well (if I understood the question correctly).

@alamb
Copy link
Contributor

alamb commented Oct 19, 2022

@alamb would you mind giving an example? Expression boundaries for a boolean field (like a, or true) would be ExpressionBoundaries { min: true, max: true, distinct: 1}. And for a filter that does a < 5, we would (or should, unless I am missing a scenario) not return a result (since they are from different datatypes, this analysis is strictly for examples where both datatypes are the same).

I guess I was thinking for an expression like
a < 5

If a has boundaries like

ExprBoundaries  {min: 0, max:10} 

I would expect the output boundaries for a < 5 to be

ExprBoundaries  {min: true, max:true} 

(which by the way devolves into something that could be used for range based pruning!)

But the code in this PR seems like it would return

ExprBoundaries  {min: 0, max:4} 

Or something

@isidentical
Copy link
Contributor Author

isidentical commented Oct 19, 2022

That is definitely an interesting point of view 👀 I was thinking more restricted towards what the filter's outcome would be (what sort of a's can there be after we execute a < 5, hence [0, 4]), but I also see your point. I think it also highly relates to what an ExprBoundaries is (and what else we collect beside it, like the discussion in #3868 (comment)).

I would expect the output boundaries for a < 5 to be ExprBoundaries {min: true, max:true}

My only worry is that a technically correct version should produce ExprBoundaries {min: false, max:true} since we don't know what a is. We know it can be true (since min(a) < 5) but we also know it can be false (max(a) >= 5). So not sure how useful that will be (but as you have already noticed, we also collect selectivity rate so for us that is more than enough). But if we have some sort of a statistics context (and expression boundary aggregator), I think this might be definitely possible.

As in, we would record what a could be in a different level (a = [0, 4] (after a < 5 has been processed, e.g. if it is inside an AND chain), a < 5 = [true, false]) than what that expression evaluates to. I think I understand it in general 👍🏻 (sorry for the confusing comment above, I was thinking something entirely different.)

@alamb
Copy link
Contributor

alamb commented Oct 19, 2022

My only worry is that a technically correct version should produce ExprBoundaries {min: false, max:true}

Yes, you are correct -- and in fact that is good information (it means the expression can not be simplified further)

@isidentical
Copy link
Contributor Author

Since there were a lot of discussions (and many thanks for it @alamb, it was very inspiring), here is a quick summary:

@alamb
Copy link
Contributor

alamb commented Oct 20, 2022

Thank you @isidentical -- I think this is great work. I will read / comment #3898 carefully later today

@alamb alamb merged commit 6d44791 into apache:master Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants