Implement `contained` API in PruningPredicate #8440

alamb · 2023-12-06T15:19:50Z

Note: the follow on PR #8442 rewrites the Bloom filter implementation to use this new API -- see the POC PR #8397 for how it fits together)

Which issue does this PR close?

Part 2 (of 3) of #8376

Rationale for this change

I am generalizing the pruning predicate to support bloom filters and other structures that can test set membership. See #8376 for more details.

This both helps DataFusion's bloom filter support, but also can be used by other systems that use PruningPredicates

What changes are included in this PR?

Adds the contained API to PruningStatistics
Connect PruningPredicate logic to contained API
tests

Are these changes tested?

Yes, there are many new tests

Are there any user-facing changes?

Yes, there is a new API for PruningPredicate, but otherwise I don't think there is anything else

alamb · 2023-12-06T17:08:10Z

datafusion/core/src/physical_optimizer/pruning.rs

+    /// container, return `None` (the default).
+    ///
+    /// Note: the returned array must contain [`Self::num_containers`] rows
+    fn contains(


This is the new API -- it is slightly different than the proposal because it takes a HashSet rather than a single value, which is necessary to support x IN (....) type predicates

alamb · 2023-12-06T17:12:17Z

datafusion/core/src/physical_optimizer/pruning.rs

@@ -276,21 +383,21 @@ fn is_always_true(expr: &Arc<dyn PhysicalExpr>) -> bool {
 /// Handles creating references to the min/max statistics
 /// for columns as well as recording which statistics are needed
 #[derive(Debug, Default, Clone)]
-pub(crate) struct RequiredStatColumns {
+pub(crate) struct RequiredColumns {


I renamed this to be more specific and since it is crate private it is not a breaking API change

One thing I did try was encoding the columns needed for literal guarantees in this structure, but I found the code was very specific to min/max/count statistics

Yeah, when I worked on statistics where I only needed min and max, I did not see the need to to use the available struct that include a lot more info

alamb · 2023-12-06T17:12:55Z

datafusion/core/src/physical_optimizer/pruning.rs

-                field.data_type().clone(),
-                field.is_nullable(),
-            );
+            // may be null if statistics are not present


A non-nullable column may appear as NULL in the min/max statistic values if the min or max values are not known, even if the original column can not contain null

alamb · 2023-12-06T17:14:27Z

datafusion/core/src/physical_optimizer/pruning.rs

-            &schema,
-            &mut RequiredStatColumns::new(),
-        );
+        let predicate_expr =


this is just reformatting resulting in a shorter name for RequiredStatColumns

alamb · 2023-12-06T17:15:03Z

datafusion/core/src/physical_optimizer/pruning.rs

@@ -2484,10 +2614,376 @@ mod tests {
        // TODO: add other negative test for other case and op
    }

+    #[test]
+    fn prune_with_contains_one_column() {


I spent quite a while on these tests and I think they are pretty thorough

alamb · 2023-12-19T17:46:57Z

FYI @waynexia @NGA-TRAN here is the next installment of pruning with equality predicates, if you have time to review I would apprecaiate it

NGA-TRAN

Nice. I have a question but I think I was confused about the inference

NGA-TRAN · 2023-12-19T21:03:55Z

datafusion/core/src/physical_optimizer/pruning.rs

+                    // column is only in the set of values so we can prune the
+                    // container
+                    Guarantee::NotIn => {
+                        builder.append_array(&arrow::compute::not(&results)?)


The Guarantee In and NotIn are used very nice here 👍

NGA-TRAN · 2023-12-19T21:11:14Z

datafusion/core/src/physical_optimizer/pruning.rs

+                // conjunct so we can't prune any containers based on that
+            }
+        }
+    }


These append functions are nice. Easy to understand.

NGA-TRAN · 2023-12-19T21:12:34Z

datafusion/core/src/physical_optimizer/pruning.rs

@@ -276,21 +383,21 @@ fn is_always_true(expr: &Arc<dyn PhysicalExpr>) -> bool {
 /// Handles creating references to the min/max statistics
 /// for columns as well as recording which statistics are needed
 #[derive(Debug, Default, Clone)]
-pub(crate) struct RequiredStatColumns {
+pub(crate) struct RequiredColumns {


Yeah, when I worked on statistics where I only needed min and max, I did not see the need to to use the available struct that include a lot more info

datafusion/core/src/physical_optimizer/pruning.rs

NGA-TRAN · 2023-12-19T21:25:03Z

datafusion/core/src/physical_optimizer/pruning.rs

+            &schema,
+            &statistics,
+            // rule out containers ('false) where we know foo is not present
+            vec![true, false, true, true, false, true, true, false, true],


Thanks for the comment

NGA-TRAN · 2023-12-19T21:29:41Z

datafusion/core/src/physical_optimizer/pruning.rs

+            // logically this predicate can't possibly be true (the column can't
+            // take on both values) but we could rule it out if the stats tell
+            // us that both values are not present
+            vec![true, true, true, true, true, true, true, true, true],


but we could rule it out if the stats tell
// us that both values are not present

Is it possible to have a test container false to say both values are not present?

Yes, you can say that by returning false for contained("s1", {foo, bar})

However, in this case I think what happens is we end up with two distinct literal guarantees and the container would have to know that a container only had foo AND only had bar, which is logically impossible.

So in other words, this expression

Pruning with expr: s1 != Utf8("foo") AND s2 != Utf8("bar")

Generates these guarantees:

Got guarantees: [ LiteralGuarantee { column: Column { relation: None, name: "s1" }, guarantee: NotIn, literals: {Utf8("foo")} }, LiteralGuarantee { column: Column { relation: None, name: "s2" }, guarantee: NotIn, literals: {Utf8("bar")} } ]

I think it would be possiible to do another round of analysis on this and prove this can never be true. I am not sure how important the use case is however.

Makes sense.Thanks Andrew

NGA-TRAN · 2023-12-19T21:36:14Z

datafusion/core/src/physical_optimizer/pruning.rs

+            vec![false, false, false, true, true, true, true, true, true],
+        );
+
+        // s1 != foo AND s1 != bar


I start getting confused. What is the difference between this and !(s1 = 'foo' OR s1 = 'bar'). Their results are not opposite of each other. I guess some inference here that I cannot figure out yet

At least in this case it has to do with what is known / provided. In this case, the logic operates on the two conjuncts separately so it consults what it knows about s1 and fooand what it knows abouts1andbar` separately.

In order to reason about s1 = 'foo' OR s1 = 'bar' it needs to used what it knows about s1 and {foo, bar} rather than about them individually

However, in this case I think what would make sense (and probably what actally happens) is that !(s1 = 'foo' OR s1 = 'bar') would be simplified to `s1 != 'foo' AND s1 != 'bar' at a higher level

datafusion/core/src/physical_optimizer/pruning.rs

Co-authored-by: Nga Tran <[email protected]>

…n into alamb/contains_api

alamb · 2023-12-19T14:51:53Z

datafusion/core/src/physical_optimizer/pruning.rs

@@ -993,95 +1102,139 @@ mod tests {
    ///
    /// Note All `ArrayRefs` must be the same size.
    struct ContainerStats {
-        min: ArrayRef,
-        max: ArrayRef,
+        min: Option<ArrayRef>,


I modified this fixture to support different combinations of min/max/contained

alamb · 2023-12-19T14:53:41Z

datafusion/core/src/physical_optimizer/pruning.rs

@@ -2484,10 +2617,376 @@ mod tests {
        // TODO: add other negative test for other case and op
    }

+    #[test]


The new tests start here

alamb · 2023-12-20T18:46:40Z

datafusion/core/src/physical_optimizer/pruning.rs

+            // logically this predicate can't possibly be true (the column can't
+            // take on both values) but we could rule it out if the stats tell
+            // us that both values are not present
+            vec![true, true, true, true, true, true, true, true, true],


Yes, you can say that by returning false for contained("s1", {foo, bar})

However, in this case I think what happens is we end up with two distinct literal guarantees and the container would have to know that a container only had foo AND only had bar, which is logically impossible.

So in other words, this expression

Pruning with expr: s1 != Utf8("foo") AND s2 != Utf8("bar")

Generates these guarantees:

Got guarantees: [ LiteralGuarantee { column: Column { relation: None, name: "s1" }, guarantee: NotIn, literals: {Utf8("foo")} }, LiteralGuarantee { column: Column { relation: None, name: "s2" }, guarantee: NotIn, literals: {Utf8("bar")} } ]

I think it would be possiible to do another round of analysis on this and prove this can never be true. I am not sure how important the use case is however.

waynexia · 2023-12-21T03:08:20Z

Sorry for the delay... I'm to review this today or tomorrow

waynexia

I've checked the new contained API and the usage of Guarantee, and it's smooth to review this PR after getting clear of Guarantee before. Thanks for submitting this and splitting them into small pieces 🚀

waynexia · 2023-12-22T14:25:30Z

datafusion/core/src/physical_optimizer/pruning.rs

+    /// A min/max pruning predicate (rewritten in terms of column min/max
+    /// values, which are supplied by statistics)
    predicate_expr: Arc<dyn PhysicalExpr>,
-    /// The statistics required to evaluate this predicate
-    required_columns: RequiredStatColumns,
-    /// Original physical predicate from which this predicate expr is derived (required for serialization)
+    /// Description of which statistics are required to evaluate `predicate_expr`
+    required_columns: RequiredColumns,
+    /// Original physical predicate from which this predicate expr is derived
+    /// (required for serialization)
    orig_expr: Arc<dyn PhysicalExpr>,
+    /// [`LiteralGuarantee`]s that are used to try and prove a predicate can not
+    /// possibly evaluate to `true`.
+    literal_guarantees: Vec<LiteralGuarantee>,


Those docs are very helpful 👍

waynexia · 2023-12-22T14:34:03Z

datafusion/core/src/physical_optimizer/pruning.rs

+        // Next, try to prove the predicate can't be true for the containers based
+        // on min/max values


I realize min/max has the same confidence with bloom filter and guarantee. Thinking this way might make it easier to verify Guarantee

waynexia · 2023-12-22T14:37:27Z

datafusion/core/src/physical_optimizer/pruning.rs

+    ///
+    /// # Panics
+    /// If `value` is not boolean
+    fn append_value(&mut self, value: ColumnarValue) {


A random thought about the naming: "append" sometimes implies "push and extend", but from the implementation, this method looks closer to "and"(&) the given boolean array with the existing one.

That is a very good point. I have renamed them to combine_array and combine_value in 71c41f2 which I think better explains what they are doing

alamb · 2023-12-22T19:45:15Z

Thank you very much for the review @waynexia and @NGA-TRAN 🙏

alamb changed the title ~~Alamb/contains api~~ Implement contains API in PruningPredicate Dec 6, 2023

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Dec 6, 2023

alamb force-pushed the alamb/contains_api branch from 18c042c to e935241 Compare December 6, 2023 17:07

alamb commented Dec 6, 2023

View reviewed changes

alamb force-pushed the alamb/contains_api branch from e935241 to 7c29979 Compare December 6, 2023 17:16

This was referenced Dec 6, 2023

Minor: reduce code duplication in PruningPredicate test #8441

Merged

Rewrite bloom filters to use contains API #8442

Merged

Support general pruning based on <col> = 'const' in PruningPredicate #8376

Closed

alamb force-pushed the alamb/contains_api branch from 7c29979 to 890c68b Compare December 18, 2023 21:47

github-actions bot removed the physical-expr Physical Expressions label Dec 18, 2023

alamb changed the title ~~Implement contains API in PruningPredicate~~ Implement contained API in PruningPredicate Dec 18, 2023

alamb force-pushed the alamb/contains_api branch from e05a18c to 5e166c9 Compare December 19, 2023 15:32

Implement contains API in PruningPredicate

66e212c

alamb force-pushed the alamb/contains_api branch from 5e166c9 to 66e212c Compare December 19, 2023 15:56

alamb marked this pull request as ready for review December 19, 2023 17:46

NGA-TRAN approved these changes Dec 19, 2023

View reviewed changes

alamb and others added 4 commits December 20, 2023 13:34

Merge remote-tracking branch 'apache/main' into alamb/contains_api

c37ff9e

Apply suggestions from code review

324cb10

Co-authored-by: Nga Tran <[email protected]>

Merge branch 'alamb/contains_api' of github.com:alamb/arrow-datafusio…

537bd07

…n into alamb/contains_api

Add comment to len(), fix fmt

3775f0f

alamb commented Dec 20, 2023

View reviewed changes

waynexia approved these changes Dec 22, 2023

View reviewed changes

alamb added 2 commits December 22, 2023 14:43

rename BoolVecBuilder::append* to BoolVecBuilder::combine*

71c41f2

Merge remote-tracking branch 'apache/main' into alamb/contains_api

3552f8d

alamb mentioned this pull request Dec 22, 2023

Config the length of list when using In_list on parquet, rather than a const of 20. #8609

Open

alamb merged commit 8524d58 into apache:main Dec 23, 2023
22 checks passed

alamb deleted the alamb/contains_api branch December 23, 2023 12:11

yahoNanJing mentioned this pull request Dec 28, 2023

Implement the contained method of RowGroupPruningStatistics introduce by #8440 #8668

Open

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `contained` API in PruningPredicate #8440

Implement `contained` API in PruningPredicate #8440

alamb commented Dec 6, 2023 •

edited

Loading

alamb Dec 6, 2023

alamb Dec 6, 2023

NGA-TRAN Dec 19, 2023

alamb Dec 6, 2023 •

edited

Loading

alamb Dec 6, 2023

alamb Dec 6, 2023

alamb commented Dec 19, 2023

NGA-TRAN left a comment

NGA-TRAN Dec 19, 2023

NGA-TRAN Dec 19, 2023

NGA-TRAN Dec 19, 2023

NGA-TRAN Dec 19, 2023

NGA-TRAN Dec 19, 2023

alamb Dec 20, 2023

NGA-TRAN Dec 20, 2023

NGA-TRAN Dec 19, 2023

alamb Dec 20, 2023

alamb Dec 19, 2023

alamb Dec 19, 2023

alamb Dec 20, 2023

waynexia commented Dec 21, 2023

waynexia left a comment

waynexia Dec 22, 2023

waynexia Dec 22, 2023

waynexia Dec 22, 2023

alamb Dec 22, 2023

alamb commented Dec 22, 2023

		// Next, try to prove the predicate can't be true for the containers based
		// on min/max values

Implement contained API in PruningPredicate #8440

Implement contained API in PruningPredicate #8440

Conversation

alamb commented Dec 6, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 19, 2023

NGA-TRAN left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waynexia commented Dec 21, 2023

waynexia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 22, 2023

Implement `contained` API in PruningPredicate #8440

Implement `contained` API in PruningPredicate #8440

alamb commented Dec 6, 2023 •

edited

Loading

alamb Dec 6, 2023 •

edited

Loading