Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement contained API in PruningPredicate #8440

Merged
merged 7 commits into from
Dec 23, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 6, 2023

Note: the follow on PR #8442 rewrites the Bloom filter implementation to use this new API -- see the POC PR #8397 for how it fits together)

Which issue does this PR close?

Part 2 (of 3) of #8376

Rationale for this change

I am generalizing the pruning predicate to support bloom filters and other structures that can test set membership. See #8376 for more details.

This both helps DataFusion's bloom filter support, but also can be used by other systems that use PruningPredicates

What changes are included in this PR?

  1. Adds the contained API to PruningStatistics
  2. Connect PruningPredicate logic to contained API
  3. tests

Are these changes tested?

Yes, there are many new tests

Are there any user-facing changes?

Yes, there is a new API for PruningPredicate, but otherwise I don't think there is anything else

@alamb alamb changed the title Alamb/contains api Implement contains API in PruningPredicate Dec 6, 2023
@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Dec 6, 2023
@alamb alamb force-pushed the alamb/contains_api branch from 18c042c to e935241 Compare December 6, 2023 17:07
/// container, return `None` (the default).
///
/// Note: the returned array must contain [`Self::num_containers`] rows
fn contains(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new API -- it is slightly different than the proposal because it takes a HashSet rather than a single value, which is necessary to support x IN (....) type predicates

@@ -276,21 +383,21 @@ fn is_always_true(expr: &Arc<dyn PhysicalExpr>) -> bool {
/// Handles creating references to the min/max statistics
/// for columns as well as recording which statistics are needed
#[derive(Debug, Default, Clone)]
pub(crate) struct RequiredStatColumns {
pub(crate) struct RequiredColumns {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed this to be more specific and since it is crate private it is not a breaking API change

One thing I did try was encoding the columns needed for literal guarantees in this structure, but I found the code was very specific to min/max/count statistics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, when I worked on statistics where I only needed min and max, I did not see the need to to use the available struct that include a lot more info

field.data_type().clone(),
field.is_nullable(),
);
// may be null if statistics are not present
Copy link
Contributor Author

@alamb alamb Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A non-nullable column may appear as NULL in the min/max statistic values if the min or max values are not known, even if the original column can not contain null

&schema,
&mut RequiredStatColumns::new(),
);
let predicate_expr =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just reformatting resulting in a shorter name for RequiredStatColumns

@@ -2484,10 +2614,376 @@ mod tests {
// TODO: add other negative test for other case and op
}

#[test]
fn prune_with_contains_one_column() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent quite a while on these tests and I think they are pretty thorough

@alamb alamb force-pushed the alamb/contains_api branch from e935241 to 7c29979 Compare December 6, 2023 17:16
@alamb alamb force-pushed the alamb/contains_api branch from 7c29979 to 890c68b Compare December 18, 2023 21:47
@github-actions github-actions bot removed the physical-expr Physical Expressions label Dec 18, 2023
@alamb alamb changed the title Implement contains API in PruningPredicate Implement contained API in PruningPredicate Dec 18, 2023
@alamb alamb force-pushed the alamb/contains_api branch from e05a18c to 5e166c9 Compare December 19, 2023 15:32
@alamb alamb force-pushed the alamb/contains_api branch from 5e166c9 to 66e212c Compare December 19, 2023 15:56
@alamb alamb marked this pull request as ready for review December 19, 2023 17:46
@alamb
Copy link
Contributor Author

alamb commented Dec 19, 2023

FYI @waynexia @NGA-TRAN here is the next installment of pruning with equality predicates, if you have time to review I would apprecaiate it

Copy link
Contributor

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I have a question but I think I was confused about the inference

// column is only in the set of values so we can prune the
// container
Guarantee::NotIn => {
builder.append_array(&arrow::compute::not(&results)?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Guarantee In and NotIn are used very nice here 👍

// conjunct so we can't prune any containers based on that
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These append functions are nice. Easy to understand.

@@ -276,21 +383,21 @@ fn is_always_true(expr: &Arc<dyn PhysicalExpr>) -> bool {
/// Handles creating references to the min/max statistics
/// for columns as well as recording which statistics are needed
#[derive(Debug, Default, Clone)]
pub(crate) struct RequiredStatColumns {
pub(crate) struct RequiredColumns {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, when I worked on statistics where I only needed min and max, I did not see the need to to use the available struct that include a lot more info

datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
&schema,
&statistics,
// rule out containers ('false) where we know foo is not present
vec![true, false, true, true, false, true, true, false, true],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment

// logically this predicate can't possibly be true (the column can't
// take on both values) but we could rule it out if the stats tell
// us that both values are not present
vec![true, true, true, true, true, true, true, true, true],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we could rule it out if the stats tell
// us that both values are not present

Is it possible to have a test container false to say both values are not present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can say that by returning false for contained("s1", {foo, bar})

However, in this case I think what happens is we end up with two distinct literal guarantees and the container would have to know that a container only had foo AND only had bar, which is logically impossible.

So in other words, this expression

Pruning with expr: s1 != Utf8("foo") AND s2 != Utf8("bar")

Generates these guarantees:

   Got guarantees: [
    LiteralGuarantee { column: Column { relation: None, name: "s1" }, guarantee: NotIn, literals: {Utf8("foo")} },
    LiteralGuarantee { column: Column { relation: None, name: "s2" }, guarantee: NotIn, literals: {Utf8("bar")} }
  ]

I think it would be possiible to do another round of analysis on this and prove this can never be true. I am not sure how important the use case is however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.Thanks Andrew

vec![false, false, false, true, true, true, true, true, true],
);

// s1 != foo AND s1 != bar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I start getting confused. What is the difference between this and !(s1 = 'foo' OR s1 = 'bar'). Their results are not opposite of each other. I guess some inference here that I cannot figure out yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in this case it has to do with what is known / provided. In this case, the logic operates on the two conjuncts separately so it consults what it knows about s1 and fooand what it knows abouts1andbar` separately.

In order to reason about s1 = 'foo' OR s1 = 'bar' it needs to used what it knows about s1 and {foo, bar} rather than about them individually

However, in this case I think what would make sense (and probably what actally happens) is that !(s1 = 'foo' OR s1 = 'bar') would be simplified to `s1 != 'foo' AND s1 != 'bar' at a higher level

datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
@@ -993,95 +1102,139 @@ mod tests {
///
/// Note All `ArrayRefs` must be the same size.
struct ContainerStats {
min: ArrayRef,
max: ArrayRef,
min: Option<ArrayRef>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified this fixture to support different combinations of min/max/contained

@@ -2484,10 +2617,376 @@ mod tests {
// TODO: add other negative test for other case and op
}

#[test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tests start here

// logically this predicate can't possibly be true (the column can't
// take on both values) but we could rule it out if the stats tell
// us that both values are not present
vec![true, true, true, true, true, true, true, true, true],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can say that by returning false for contained("s1", {foo, bar})

However, in this case I think what happens is we end up with two distinct literal guarantees and the container would have to know that a container only had foo AND only had bar, which is logically impossible.

So in other words, this expression

Pruning with expr: s1 != Utf8("foo") AND s2 != Utf8("bar")

Generates these guarantees:

   Got guarantees: [
    LiteralGuarantee { column: Column { relation: None, name: "s1" }, guarantee: NotIn, literals: {Utf8("foo")} },
    LiteralGuarantee { column: Column { relation: None, name: "s2" }, guarantee: NotIn, literals: {Utf8("bar")} }
  ]

I think it would be possiible to do another round of analysis on this and prove this can never be true. I am not sure how important the use case is however.

@waynexia
Copy link
Member

Sorry for the delay... I'm to review this today or tomorrow

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the new contained API and the usage of Guarantee, and it's smooth to review this PR after getting clear of Guarantee before. Thanks for submitting this and splitting them into small pieces 🚀

Comment on lines +170 to +180
/// A min/max pruning predicate (rewritten in terms of column min/max
/// values, which are supplied by statistics)
predicate_expr: Arc<dyn PhysicalExpr>,
/// The statistics required to evaluate this predicate
required_columns: RequiredStatColumns,
/// Original physical predicate from which this predicate expr is derived (required for serialization)
/// Description of which statistics are required to evaluate `predicate_expr`
required_columns: RequiredColumns,
/// Original physical predicate from which this predicate expr is derived
/// (required for serialization)
orig_expr: Arc<dyn PhysicalExpr>,
/// [`LiteralGuarantee`]s that are used to try and prove a predicate can not
/// possibly evaluate to `true`.
literal_guarantees: Vec<LiteralGuarantee>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those docs are very helpful 👍

Comment on lines +264 to +265
// Next, try to prove the predicate can't be true for the containers based
// on min/max values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize min/max has the same confidence with bloom filter and guarantee. Thinking this way might make it easier to verify Guarantee

///
/// # Panics
/// If `value` is not boolean
fn append_value(&mut self, value: ColumnarValue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A random thought about the naming: "append" sometimes implies "push and extend", but from the implementation, this method looks closer to "and"(&) the given boolean array with the existing one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good point. I have renamed them to combine_array and combine_value in 71c41f2 which I think better explains what they are doing

@alamb
Copy link
Contributor Author

alamb commented Dec 22, 2023

Thank you very much for the review @waynexia and @NGA-TRAN 🙏

@alamb alamb merged commit 8524d58 into apache:main Dec 23, 2023
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants