Implement foundational filter selectivity analysis #3868

isidentical · 2022-10-18T01:52:56Z

Which issue does this PR close?

Part of #3845.

Rationale for this change

DataFusion have some cost based optimizer rules (that operate on the physical plan) where statistics are leveraged into picking a new version of that plan with less 'possible' cost. One great example is the side swapping on hash join where we try to make the build side as small as possible. These rules work reasonably well when there is no interference between the joins and the table provider (since joins can now estimate their output cardinality) but one common thing that can cut this flow is a filter node which currently does not support producing statistics (as shown in @Dandandan 's example in #3845).

What changes are included in this PR?

This PR does not implement the statistics propagation for filters, but I can also include that. The main reasoning was that, since it is very self-contained (the new expression statistics API and the filter selectivity analysis), it could be had separately and then we can build the filter statistics estimation methods on top of it. If it makes sense to include it here as well, please let me know (or if this PR is too big by itself, I can also split it further, depending on what is easier to review!)

Expression Statistics

Since we needed a way of propagating statistics at the expression level (rather than the plan level), we now have a new API that is more suitable for it. Instead of dealing with individual columns, this API deals with boundaries of expression results. For the initial work, this is implemented for the following expressions (but also can be extended further (knowing the boundary of a + 1 or max(a) is easier once we have these 3):

Literals, where the boundary is min/max are the value that it holds and distinct_count is 1
Columns, where the boundary is the same as the boundary from the column stats (which are passed from the plan into the expressions, when using this API).
Comparisons (=, >, >=, <, <=) with one column and one literal (actually not a literal, but an expression which can have its boundary reduced to single scalar). Using the filter selectivity analysis (that is also implemented in this PR) we estimate the min/max values (and the selectivity of the statistics) we can reason about the expressions lower and upper bounds (as well as its selectivity)

There is also a new API for updating the statistics of a column with new, known boundaries on a separate context. This is currently not used anywhere (although implemented and tested), but its primary functionality will be limiting the known maximums and minimums once we have support for composite predicates (A = 50 AND B > A, can now see that A is actually 50, at least in that context, so that different contexts can apply between different splits of predicates).

Filter Selectivity Analysis

The selectivity analysis here is based on the uniform distribution assumption (since we currently don't have histograms, and they might never come in the short term) and uses the basic column bounds to match the ranges with this assumption.

Considering that $ is a uniformly distributed column (e.g. list(range(1, 100+1))), the selectivity of each operation is the following:

$ = [1, 100]	min	max	selectivity (formula)	selectivity
$ = 75	75	75	`1 / range($)`	%1
$ >= 75	75	`max($) = 100`	`((max($) - 50) + 1) / range($)`	%26
$ > 75	76*	`max($) = 100`	`(max($) - 50) / range($)`	%25
$ <= 75	`min($) = 1`	75	`((75 - min($)) + 1) / range($)`	%75
$ < 75	`min($) = 1`	74*	`(75 - min($)) / range($)`	%74

This uses set bounds and tries to assume partial matches would be in the intersections (which is easier when we know one of the sets has a static bound, {75}) and with this, we produce three different values: a new min (or the old one, depending on the operator), a new max (same), and a filter selectivity rate (percentage of rows that the given expression would select if it is was used as a predicate on a uniformly distributed row, the non-uniform percentages might be different than the ones above [if you are interested in playing out, I have a very small script that shows them]).

Low Priority / Other (but still needed)

Moving ColumnStatistics and Statistics out of core/physical_plan (and into datafusion-common). They are still exposed from core/physical_plan to not to break the public API.
Implementing a general absolute distance method that works with numeric scalars, scalar.distance(other) would return usize(|scalar - other|). This API is something that I saw was needed quite a bit when dealing with statistics, but if it is a bit too much (since we already have scalar.sub), I could also be persuaded into making it a utility method somewhere else.
Operator.swap() for returning a swapped version of the same operator that can function if lhs/rhs are also swapped.

Are there any user-facing changes?

Quite a bit if you count the new APIs, if not, not much.

alamb · 2022-10-18T13:58:10Z

Thanks @isidentical -- I will try and review this tomorrow

datafusion/core/src/physical_plan/mod.rs

datafusion/expr/src/operator.rs

alamb

Thank you @isidentical -- I think this is a really exciting step forward for DataFusion

This PR does not implement the statistics propagation for filters, but I can also include that. ... If it makes sense to include it here as well, please let me know (or if this PR is too big by itself, I can also split it further, depending on what is easier to review!)

It is great that you are splitting the PRs up to make them easier to review. Thank you for both explaining the context as well as trying to keep them in reviewable pieces

The only part of this PR I am not sure about is the calculation of min/max values for the boolean binary expressions -- however, I think it would also be fine to merge this PR as is and iterate on master.

datafusion/core/src/physical_plan/mod.rs

datafusion/physical-expr/src/expressions/binary.rs

datafusion/physical-expr/src/physical_expr.rs

isidentical · 2022-10-19T21:06:17Z

Thank you so much for your reviews @alamb and @Dandandan! I'll try to address them, but yes, I am also quite excited for the expression boundary analysis framework 🚀

The only part of this PR I am not sure about is the calculation of min/max values for the boolean binary expressions -- however, I think it would also be fine to merge this PR as is and iterate on master.

@alamb would you mind giving an example? Expression boundaries for a boolean field (like a, or true) would be ExpressionBoundaries { min: true, max: true, distinct: 1}. And for a filter that does a < 5, we would (or should, unless I am missing a scenario) not return a result (since they are from different datatypes, this analysis is strictly for examples where both datatypes are the same).

In any case, I'd be happy to also add a test for that case as well (if I understood the question correctly).

alamb · 2022-10-19T21:11:24Z

@alamb would you mind giving an example? Expression boundaries for a boolean field (like a, or true) would be ExpressionBoundaries { min: true, max: true, distinct: 1}. And for a filter that does a < 5, we would (or should, unless I am missing a scenario) not return a result (since they are from different datatypes, this analysis is strictly for examples where both datatypes are the same).

I guess I was thinking for an expression like
a < 5

If a has boundaries like

ExprBoundaries  {min: 0, max:10}

I would expect the output boundaries for a < 5 to be

ExprBoundaries  {min: true, max:true}

(which by the way devolves into something that could be used for range based pruning!)

But the code in this PR seems like it would return

ExprBoundaries  {min: 0, max:4}

Or something

isidentical · 2022-10-19T21:37:52Z

That is definitely an interesting point of view 👀 I was thinking more restricted towards what the filter's outcome would be (what sort of a's can there be after we execute a < 5, hence [0, 4]), but I also see your point. I think it also highly relates to what an ExprBoundaries is (and what else we collect beside it, like the discussion in #3868 (comment)).

I would expect the output boundaries for a < 5 to be ExprBoundaries {min: true, max:true}

My only worry is that a technically correct version should produce ExprBoundaries {min: false, max:true} since we don't know what a is. We know it can be true (since min(a) < 5) but we also know it can be false (max(a) >= 5). So not sure how useful that will be (but as you have already noticed, we also collect selectivity rate so for us that is more than enough). But if we have some sort of a statistics context (and expression boundary aggregator), I think this might be definitely possible.

As in, we would record what a could be in a different level (a = [0, 4] (after a < 5 has been processed, e.g. if it is inside an AND chain), a < 5 = [true, false]) than what that expression evaluates to. I think I understand it in general 👍🏻 (sorry for the confusing comment above, I was thinking something entirely different.)

alamb · 2022-10-19T21:49:07Z

My only worry is that a technically correct version should produce ExprBoundaries {min: false, max:true}

Yes, you are correct -- and in fact that is good information (it means the expression can not be simplified further)

isidentical · 2022-10-20T00:57:38Z

Since there were a lot of discussions (and many thanks for it @alamb, it was very inspiring), here is a quick summary:

I've pushed 09dfd8d which should address all the minor points (as well as the removal of update_boundaries API).
Created A framework for expression boundary analysis (and statistics) #3898 for the creation of a generic expression boundary analysis framework, which should address the generic handling of context-dependant column boundaries and more!

alamb · 2022-10-20T15:31:18Z

Thank you @isidentical -- I think this is great work. I will read / comment #3898 carefully later today

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions labels Oct 18, 2022

isidentical marked this pull request as ready for review October 18, 2022 13:11

isidentical mentioned this pull request Oct 19, 2022

Allow configuring collection of statistics during TPC-H benchmarks #3889

Merged

Dandandan reviewed Oct 19, 2022

View reviewed changes

datafusion/core/src/physical_plan/mod.rs Show resolved Hide resolved

Dandandan reviewed Oct 19, 2022

View reviewed changes

datafusion/expr/src/operator.rs Outdated Show resolved Hide resolved

alamb approved these changes Oct 19, 2022

View reviewed changes

isidentical mentioned this pull request Oct 20, 2022

A framework for expression boundary analysis (and statistics) #3898

Closed

isidentical added 2 commits October 20, 2022 03:52

Implement foundational filter selectivity analysis

098a373

Apply suggestions & minor fixes

09dfd8d

isidentical force-pushed the gh-3845 branch from ea3f632 to 09dfd8d Compare October 20, 2022 00:52

alamb merged commit 6d44791 into apache:master Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement foundational filter selectivity analysis #3868

Implement foundational filter selectivity analysis #3868

isidentical commented Oct 18, 2022 •

edited

Loading

alamb commented Oct 18, 2022

alamb left a comment

isidentical commented Oct 19, 2022 •

edited

Loading

alamb commented Oct 19, 2022

isidentical commented Oct 19, 2022 •

edited

Loading

alamb commented Oct 19, 2022

isidentical commented Oct 20, 2022

alamb commented Oct 20, 2022

Implement foundational filter selectivity analysis #3868

Implement foundational filter selectivity analysis #3868

Conversation

isidentical commented Oct 18, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Expression Statistics

Filter Selectivity Analysis

Low Priority / Other (but still needed)

Are there any user-facing changes?

alamb commented Oct 18, 2022

alamb left a comment

Choose a reason for hiding this comment

isidentical commented Oct 19, 2022 • edited Loading

alamb commented Oct 19, 2022

isidentical commented Oct 19, 2022 • edited Loading

alamb commented Oct 19, 2022

isidentical commented Oct 20, 2022

alamb commented Oct 20, 2022

isidentical commented Oct 18, 2022 •

edited

Loading

isidentical commented Oct 19, 2022 •

edited

Loading

isidentical commented Oct 19, 2022 •

edited

Loading