-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement foundational filter selectivity analysis #3868
Conversation
Thanks @isidentical -- I will try and review this tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @isidentical -- I think this is a really exciting step forward for DataFusion
This PR does not implement the statistics propagation for filters, but I can also include that. ... If it makes sense to include it here as well, please let me know (or if this PR is too big by itself, I can also split it further, depending on what is easier to review!)
It is great that you are splitting the PRs up to make them easier to review. Thank you for both explaining the context as well as trying to keep them in reviewable pieces
The only part of this PR I am not sure about is the calculation of min/max values for the boolean binary expressions -- however, I think it would also be fine to merge this PR as is and iterate on master.
Thank you so much for your reviews @alamb and @Dandandan! I'll try to address them, but yes, I am also quite excited for the expression boundary analysis framework 🚀
@alamb would you mind giving an example? Expression boundaries for a boolean field (like In any case, I'd be happy to also add a test for that case as well (if I understood the question correctly). |
I guess I was thinking for an expression like If a has boundaries like
I would expect the output boundaries for
(which by the way devolves into something that could be used for range based pruning!) But the code in this PR seems like it would return
Or something |
That is definitely an interesting point of view 👀 I was thinking more restricted towards what the filter's outcome would be (what sort of
My only worry is that a technically correct version should produce As in, we would record what |
Yes, you are correct -- and in fact that is good information (it means the expression can not be simplified further) |
Since there were a lot of discussions (and many thanks for it @alamb, it was very inspiring), here is a quick summary:
|
Thank you @isidentical -- I think this is great work. I will read / comment #3898 carefully later today |
Which issue does this PR close?
Part of #3845.
Rationale for this change
DataFusion have some cost based optimizer rules (that operate on the physical plan) where statistics are leveraged into picking a new version of that plan with less 'possible' cost. One great example is the side swapping on hash join where we try to make the build side as small as possible. These rules work reasonably well when there is no interference between the joins and the table provider (since joins can now estimate their output cardinality) but one common thing that can cut this flow is a filter node which currently does not support producing statistics (as shown in @Dandandan 's example in #3845).
What changes are included in this PR?
This PR does not implement the statistics propagation for filters, but I can also include that. The main reasoning was that, since it is very self-contained (the new expression statistics API and the filter selectivity analysis), it could be had separately and then we can build the filter statistics estimation methods on top of it. If it makes sense to include it here as well, please let me know (or if this PR is too big by itself, I can also split it further, depending on what is easier to review!)
Expression Statistics
Since we needed a way of propagating statistics at the expression level (rather than the plan level), we now have a new API that is more suitable for it. Instead of dealing with individual columns, this API deals with boundaries of expression results. For the initial work, this is implemented for the following expressions (but also can be extended further (knowing the boundary of
a + 1
ormax(a)
is easier once we have these 3):min
/max
are the value that it holds anddistinct_count
is1
min
/max
values (and the selectivity of the statistics) we can reason about the expressions lower and upper bounds (as well as its selectivity)There is also a new API for updating the statistics of a column with new, known boundaries on a separate context. This is currently not used anywhere (although implemented and tested), but its primary functionality will be limiting the known maximums and minimums once we have support for composite predicates (
A = 50 AND B > A
, can now see thatA
is actually50
, at least in that context, so that different contexts can apply between different splits of predicates).Filter Selectivity Analysis
The selectivity analysis here is based on the uniform distribution assumption (since we currently don't have histograms, and they might never come in the short term) and uses the basic column bounds to match the ranges with this assumption.
Considering that
$
is a uniformly distributed column (e.g.list(range(1, 100+1))
), the selectivity of each operation is the following:1 / range($)
max($) = 100
((max($) - 50) + 1) / range($)
max($) = 100
(max($) - 50) / range($)
min($) = 1
((75 - min($)) + 1) / range($)
min($) = 1
(75 - min($)) / range($)
This uses set bounds and tries to assume partial matches would be in the intersections (which is easier when we know one of the sets has a static bound,
{75}
) and with this, we produce three different values: a newmin
(or the old one, depending on the operator), a newmax
(same), and a filter selectivity rate (percentage of rows that the given expression would select if it is was used as a predicate on a uniformly distributed row, the non-uniform percentages might be different than the ones above [if you are interested in playing out, I have a very small script that shows them]).Low Priority / Other (but still needed)
Moving
ColumnStatistics
andStatistics
out ofcore/physical_plan
(and intodatafusion-common
). They are still exposed fromcore/physical_plan
to not to break the public API.Implementing a general absolute distance method that works with numeric scalars,
scalar.distance(other)
would returnusize(|scalar - other|)
. This API is something that I saw was needed quite a bit when dealing with statistics, but if it is a bit too much (since we already havescalar.sub
), I could also be persuaded into making it a utility method somewhere else.Operator.swap()
for returning a swapped version of the same operator that can function if lhs/rhs are also swapped.Are there any user-facing changes?
Quite a bit if you count the new APIs, if not, not much.