-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement contained
API in PruningPredicate
#8440
Conversation
18c042c
to
e935241
Compare
/// container, return `None` (the default). | ||
/// | ||
/// Note: the returned array must contain [`Self::num_containers`] rows | ||
fn contains( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the new API -- it is slightly different than the proposal because it takes a HashSet
rather than a single value, which is necessary to support x IN (....)
type predicates
@@ -276,21 +383,21 @@ fn is_always_true(expr: &Arc<dyn PhysicalExpr>) -> bool { | |||
/// Handles creating references to the min/max statistics | |||
/// for columns as well as recording which statistics are needed | |||
#[derive(Debug, Default, Clone)] | |||
pub(crate) struct RequiredStatColumns { | |||
pub(crate) struct RequiredColumns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this to be more specific and since it is crate
private it is not a breaking API change
One thing I did try was encoding the columns needed for literal guarantees in this structure, but I found the code was very specific to min/max/count
statistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, when I worked on statistics where I only needed min and max, I did not see the need to to use the available struct that include a lot more info
field.data_type().clone(), | ||
field.is_nullable(), | ||
); | ||
// may be null if statistics are not present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A non-nullable column may appear as NULL in the min/max
statistic values if the min or max values are not known, even if the original column can not contain null
&schema, | ||
&mut RequiredStatColumns::new(), | ||
); | ||
let predicate_expr = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just reformatting resulting in a shorter name for RequiredStatColumns
@@ -2484,10 +2614,376 @@ mod tests { | |||
// TODO: add other negative test for other case and op | |||
} | |||
|
|||
#[test] | |||
fn prune_with_contains_one_column() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent quite a while on these tests and I think they are pretty thorough
e935241
to
7c29979
Compare
7c29979
to
890c68b
Compare
contains
API in PruningPredicatecontained
API in PruningPredicate
e05a18c
to
5e166c9
Compare
5e166c9
to
66e212c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I have a question but I think I was confused about the inference
// column is only in the set of values so we can prune the | ||
// container | ||
Guarantee::NotIn => { | ||
builder.append_array(&arrow::compute::not(&results)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Guarantee In and NotIn are used very nice here 👍
// conjunct so we can't prune any containers based on that | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These append functions are nice. Easy to understand.
@@ -276,21 +383,21 @@ fn is_always_true(expr: &Arc<dyn PhysicalExpr>) -> bool { | |||
/// Handles creating references to the min/max statistics | |||
/// for columns as well as recording which statistics are needed | |||
#[derive(Debug, Default, Clone)] | |||
pub(crate) struct RequiredStatColumns { | |||
pub(crate) struct RequiredColumns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, when I worked on statistics where I only needed min and max, I did not see the need to to use the available struct that include a lot more info
&schema, | ||
&statistics, | ||
// rule out containers ('false) where we know foo is not present | ||
vec![true, false, true, true, false, true, true, false, true], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment
// logically this predicate can't possibly be true (the column can't | ||
// take on both values) but we could rule it out if the stats tell | ||
// us that both values are not present | ||
vec![true, true, true, true, true, true, true, true, true], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we could rule it out if the stats tell
// us that both values are not present
Is it possible to have a test container false
to say both values are not present
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can say that by returning false
for contained("s1", {foo, bar})
However, in this case I think what happens is we end up with two distinct literal guarantees and the container would have to know that a container only had foo AND only had bar, which is logically impossible.
So in other words, this expression
Pruning with expr: s1 != Utf8("foo") AND s2 != Utf8("bar")
Generates these guarantees:
Got guarantees: [
LiteralGuarantee { column: Column { relation: None, name: "s1" }, guarantee: NotIn, literals: {Utf8("foo")} },
LiteralGuarantee { column: Column { relation: None, name: "s2" }, guarantee: NotIn, literals: {Utf8("bar")} }
]
I think it would be possiible to do another round of analysis on this and prove this can never be true. I am not sure how important the use case is however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.Thanks Andrew
vec![false, false, false, true, true, true, true, true, true], | ||
); | ||
|
||
// s1 != foo AND s1 != bar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I start getting confused. What is the difference between this and !(s1 = 'foo' OR s1 = 'bar')
. Their results are not opposite of each other. I guess some inference here that I cannot figure out yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in this case it has to do with what is known / provided. In this case, the logic operates on the two conjuncts separately so it consults what it knows about s1 and
fooand what it knows about
s1and
bar` separately.
In order to reason about s1 = 'foo' OR s1 = 'bar'
it needs to used what it knows about s1
and {foo, bar}
rather than about them individually
However, in this case I think what would make sense (and probably what actally happens) is that !(s1 = 'foo' OR s1 = 'bar')
would be simplified to `s1 != 'foo' AND s1 != 'bar' at a higher level
Co-authored-by: Nga Tran <[email protected]>
…n into alamb/contains_api
@@ -993,95 +1102,139 @@ mod tests { | |||
/// | |||
/// Note All `ArrayRefs` must be the same size. | |||
struct ContainerStats { | |||
min: ArrayRef, | |||
max: ArrayRef, | |||
min: Option<ArrayRef>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified this fixture to support different combinations of min/max/contained
@@ -2484,10 +2617,376 @@ mod tests { | |||
// TODO: add other negative test for other case and op | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new tests start here
// logically this predicate can't possibly be true (the column can't | ||
// take on both values) but we could rule it out if the stats tell | ||
// us that both values are not present | ||
vec![true, true, true, true, true, true, true, true, true], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can say that by returning false
for contained("s1", {foo, bar})
However, in this case I think what happens is we end up with two distinct literal guarantees and the container would have to know that a container only had foo AND only had bar, which is logically impossible.
So in other words, this expression
Pruning with expr: s1 != Utf8("foo") AND s2 != Utf8("bar")
Generates these guarantees:
Got guarantees: [
LiteralGuarantee { column: Column { relation: None, name: "s1" }, guarantee: NotIn, literals: {Utf8("foo")} },
LiteralGuarantee { column: Column { relation: None, name: "s2" }, guarantee: NotIn, literals: {Utf8("bar")} }
]
I think it would be possiible to do another round of analysis on this and prove this can never be true. I am not sure how important the use case is however.
Sorry for the delay... I'm to review this today or tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked the new contained
API and the usage of Guarantee
, and it's smooth to review this PR after getting clear of Guarantee
before. Thanks for submitting this and splitting them into small pieces 🚀
/// A min/max pruning predicate (rewritten in terms of column min/max | ||
/// values, which are supplied by statistics) | ||
predicate_expr: Arc<dyn PhysicalExpr>, | ||
/// The statistics required to evaluate this predicate | ||
required_columns: RequiredStatColumns, | ||
/// Original physical predicate from which this predicate expr is derived (required for serialization) | ||
/// Description of which statistics are required to evaluate `predicate_expr` | ||
required_columns: RequiredColumns, | ||
/// Original physical predicate from which this predicate expr is derived | ||
/// (required for serialization) | ||
orig_expr: Arc<dyn PhysicalExpr>, | ||
/// [`LiteralGuarantee`]s that are used to try and prove a predicate can not | ||
/// possibly evaluate to `true`. | ||
literal_guarantees: Vec<LiteralGuarantee>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those docs are very helpful 👍
// Next, try to prove the predicate can't be true for the containers based | ||
// on min/max values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize min/max has the same confidence with bloom filter and guarantee. Thinking this way might make it easier to verify Guarantee
/// | ||
/// # Panics | ||
/// If `value` is not boolean | ||
fn append_value(&mut self, value: ColumnarValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A random thought about the naming: "append" sometimes implies "push and extend", but from the implementation, this method looks closer to "and"(&) the given boolean array with the existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very good point. I have renamed them to combine_array
and combine_value
in 71c41f2 which I think better explains what they are doing
Note: the follow on PR #8442 rewrites the Bloom filter implementation to use this new API -- see the POC PR #8397 for how it fits together)
Which issue does this PR close?
Part 2 (of 3) of #8376
Rationale for this change
I am generalizing the pruning predicate to support bloom filters and other structures that can test set membership. See #8376 for more details.
This both helps DataFusion's bloom filter support, but also can be used by other systems that use
PruningPredicate
sWhat changes are included in this PR?
contained
API toPruningStatistics
contained
APIAre these changes tested?
Yes, there are many new tests
Are there any user-facing changes?
Yes, there is a new API for
PruningPredicate
, but otherwise I don't think there is anything else