opt: index accelerate array overlap operator `&&` #75477

mgartner · 2022-01-25T00:43:25Z

The array overlap operator, &&, provides a convenient syntax for filtering for arrays that have one more more elements in common with another array. However, the && operator is not index accelerated, so an efficient inverted index scan in not planned. The current workaround is to rewrite the query by combining the containment operator with OR expressions:

CREATE TABLE movies (
  id INT PRIMARY KEY,
  name STRING,
  tags STRING [],
  INVERTED INDEX (tags)
);

-- With the overlap operator, the query is not index accerlated:
EXPLAIN SELECT * FROM movies WHERE tags && ARRAY['sci-fi', 'fantasy'];
                     info
-----------------------------------------------
  distribution: full
  vectorized: true

  • filter
  │ filter: tags && ARRAY['sci-fi','fantasy']
  │
  └── • scan
        missing stats
        table: movies@primary
        spans: FULL SCAN
(10 rows)

-- This alternative query returns the same results and is index accelerated:
EXPLAIN SELECT * FROM movies WHERE tags @> ARRAY['sci-fi'] OR tags @> ARRAY['fantasy'];
                                         info
--------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • index join
  │ estimated row count: 0
  │ table: movies@primary
  │
  └── • inverted filter
      │ inverted column: tags_inverted_key
      │ num spans: 2
      │
      └── • scan
            estimated row count: 0 (11% of the table; stats collected 2 minutes ago)
            table: movies@movies_tags_idx
            spans: 2 spans
(15 rows)

Jira issue: CRDB-12686

The text was updated successfully, but these errors were encountered:

RajivTS · 2022-01-25T21:59:16Z

Hi, I would like to take this up but I am fairly new to CockroachDB and would need some pointers to get started. Let me know if that's ok :)

mgartner · 2022-01-26T22:35:19Z

Hi @RajivTS! You're absolutely welcome to give this one a shot. Now that I think about it, I may have incorrectly labelled this a "good first issue" - it's a bit challenging for a newcomer. But not impossible, so don't let me deter you!

This issue requires recognizing the && operator (a memo.OverlapsExpr within the optimizer) in a filter and generating a semantically equivalent inverted.Expression that can be used to perform an inverted index scan. A good starting place would be to look at #61219 where the <@ (contained-by operator) was index accelerated. You'll be extending the same code and adding similar tests. I'm happy to answer any specific questions that come up.

RajivTS · 2022-01-27T22:34:34Z

Hi @mgartner!

Thank you for the high level overview and the reference PR. Based on what I have seen so far, I have the following understanding of the changes needed to address this issue. I might be way off the mark here so please do correct me wherever I am wrong.

The set of changes needed (except tests) are:

Add a case for tree.Overlaps in the Operator.Symbol switch in NewJSONOrArrayDatumsToInvertedExpr function.
Add a case for *memo.OverlapsExpr in extractInvertedFilterConditionFromLeaf which invokes the newly created extractJSONOrArrayOverlapsCondition. The implementation would remain similar to the contains/containedBy counterpart with respect to the selection of indexed column and constant value from the left and right operands.
Once the operands are decided and the constant datum is extracted, the new getInvertedExprForJSONOrArrayIndexForOverlaps function is invoked which returns the final inverted expression. This function depends on the equivalent encoding function.
Implement an encoding (e.g. EncodeOverlapsInvertedIndexSpans) for the && operator in index_encoding.go that returns an inverted.Expression which is a span representing the union of all paths through the array (overlaps == even one element in common works).
The logic in EncodeOverlapsInvertedIndexSpans would be very similar (for the most part) to EncodeContainedArrayInvertedIndexSpans since both of them are interested in union.
The two notable differences between the encoding for overlaps and containedBy are pertaining to empty array and tightness of the resultant span expression.
The encoding for containedBy always returns an empty array span expression since it is contained by everything. But for overlaps to evaluate to true, (I think) there must be a non-empty overlap so we don't need an empty array span expression.
In contrast to the containedBy encoding, the final inverted expression generated for overlaps should be tight given that we don't need to further filter out the results returned by this scan.

I also have a few open questions (and will probably have more as I progress):

Will the && operator always be used in the structure shown above i.e. array column on left and const array on right? Or can we have indexed columns on both sides of the operator? If yes, then would that make a difference? Do we need to handle it separately?
Specifically if there is no constant array on either side of the operator and we instead have a JSON fetch operator, should it be handled similar to the contains/containedBy case?
What should be the behavior of overlaps with null values, i.e. should SELECT ARRAY['A', Null] && ARRAY['B', Null] return true?

mgartner · 2022-01-28T00:09:29Z

Implement an encoding (e.g. EncodeOverlapsInvertedIndexSpans) for the && operator in index_encoding.go that returns an inverted.Expression which is a span representing the union of all paths through the array (overlaps == even one element in common works).

The logic in EncodeOverlapsInvertedIndexSpans would be very similar (for the most part) to EncodeContainedArrayInvertedIndexSpans since both of them are interested in union.

The two notable differences between the encoding for overlaps and containedBy are pertaining to empty array and tightness of the resultant span expression.

In contrast to the containedBy encoding, the final inverted expression generated for overlaps should be tight given that we don't need to further filter out the results returned by this scan.

This sounds right. Beware of NULL semantics - what is the behavior when there are NULLs in either or both arrays on each side of the ?? operator. NULLs always act in surprising ways 😈 . Here are some examples of the behavior in Postgres:

marcus=# SELECT ARRAY[1]::INT[] && ARRAY[1]::INT[];
 ?column?
----------
 t
(1 row)

marcus=# SELECT ARRAY[1]::INT[] && ARRAY[1, NULL]::INT[];
 ?column?
----------
 t
(1 row)

marcus=# SELECT ARRAY[1, NULL]::INT[] && ARRAY[1]::INT[];
 ?column?
----------
 t
(1 row)

marcus=# SELECT ARRAY[1, NULL]::INT[] && ARRAY[NULL]::INT[];
 ?column?
----------
 f
(1 row)

marcus=# SELECT ARRAY[NULL]::INT[] && ARRAY[1, NULL]::INT[];
 ?column?
----------
 f
(1 row)

marcus=# SELECT ARRAY[NULL]::INT[] && ARRAY[NULL]::INT[];
 ?column?
----------
 f
(1 row)

The logic might be as simple as this block of code:

cockroach/pkg/sql/rowenc/index_encoding.go

Lines 797 to 802 in 98bd5a8

    
           for _, key := range keys { 
        
           	spanExpr := inverted.ExprForSpan( 
        
           		inverted.MakeSingleValSpan(key), false, /* tight */ 
        
           	) 
        
           	invertedExpr = inverted.Or(invertedExpr, spanExpr) 
        
           }

Will the && operator always be used in the structure shown above i.e. array column on left and const array on right? Or can we have indexed columns on both sides of the operator? If yes, then would that make a difference? Do we need to handle it separately?

Great question! We already have a normalization rule that commutes binary operators so that variables are always on the left, so that other rules and logic doesn't need to handle both cases. && is commutative so we should add Overlaps here:

cockroach/pkg/sql/opt/norm/rules/scalar.opt

Line 9 in a0e6f39

You can extend this test to make sure it works properly:

cockroach/pkg/sql/opt/norm/testdata/rules/scalar

Lines 27 to 39 in a812e22

    
           norm expect=CommuteVar 
        
           SELECT 
        
               (1+i) = k AS r, 
        
               (2-k) <> i AS s, 
        
               (i+1) IS NOT DISTINCT FROM k AS t, 
        
               (i-1) IS DISTINCT FROM k AS u, 
        
               (i*2) + k AS v, 
        
               (i+2) * k AS w, 
        
               (i^2) & k AS x, 
        
               (i^2) | k AS y, 
        
               (i*i) # k AS z 
        
           FROM a

Feel free to submit that as a separate commit or even a separate PR. It's a simpler, self-contained change and it might be nice to make incremental progress towards the larger goal.

Specifically if there is no constant array on either side of the operator and we instead have a JSON fetch operator, should it be handled similar to the contains/containedBy case?

I'm not sure I understand the question, sorry. The && operator only works for ARRAY types, not JSON, so I don't think you'll need to worry about anything JSON related.

In Postgres:

marcus=# SELECT '[1, 2]'::JSON && '[2, 3]'::JSON;
ERROR:  42883: operator does not exist: json && json
LINE 1: SELECT '[1, 2]'::JSON && '[2, 3]'::JSON;
                              ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

What should be the behavior of overlaps with null values, i.e. should SELECT ARRAY['A', Null] && ARRAY['B', Null] return true?

Another great question! You anticipate my warning about NULLs above 😄 . When in doubt, it's good to check the behavior of Postgres and of the latest version of CRDB to figure out the desired behavior (queries with && already work in CRDB, they just aren't index accelerated). Be sure to add tests with various combinations of NULLs for index accelerated queries so we have confidence that they behave correctly.

RajivTS · 2022-01-31T23:33:50Z

Thank you @mgartner for the clarifications! I do have a few remaining questions before I can complete my changes.

We already have a normalization rule that commutes binary operators so that variables are always on the left, so that other rules and logic doesn't need to handle both cases

So does this mean the below structure (from contains implementation) for deciding the index column and the constant value is not required in case of overlaps (i.e. left = index and right = const always)?

cockroach/pkg/sql/opt/invertedidx/json_array.go

Lines 374 to 384 in 30ee32c

    
           if isIndexColumn(j.tabID, j.index, left, j.computedColumns) && memo.CanExtractConstDatum(right) { 
        
           	// When the first argument is a variable or expression corresponding to the 
        
           	// index column and the second argument is a constant, we get the 
        
           	// InvertedExpression for left @> right or left <@ right. 
        
           	indexColumn, constantVal = left, right 
        
           } else if isIndexColumn(j.tabID, j.index, right, j.computedColumns) && memo.CanExtractConstDatum(left) { 
        
           	// When the second argument is a variable or expression corresponding to 
        
           	// the index column and the first argument is a constant, we get the 
        
           	// equivalent InvertedExpression for right <@ left or right @> left. 
        
           	indexColumn, constantVal = right, left 
        
           	containedBy = !containedBy

You can extend this test to make sure it works properly:

cockroach/pkg/sql/opt/norm/testdata/rules/scalar

Lines 27 to 39 in a812e22
norm expect=CommuteVar
SELECT
(1+i) = k AS r,
(2-k) <> i AS s,
(i+1) IS NOT DISTINCT FROM k AS t,
(i-1) IS DISTINCT FROM k AS u,
(i2) + k AS v,
(i+2) * k AS w,
(i^2) & k AS x,
(i^2) | k AS y,
(ii) # k AS z
FROM a

I am sorry but I don't quite understand how the above tests need to be modified to validate the commutativity property of the && operator. Would the following be a step in the right direction?

SELECT
    (ARRAY[k]::INT[] && arr) AS ov,
    (arr && ARRAY[k]::INT[]) IS NOT DISTINCT FROM ov AS result
FROM a

I am also having difficulty in following the project output that is added below each test which shows the columns, scans, projections etc.

cockroach/pkg/sql/opt/norm/testdata/rules/scalar

Lines 213 to 218 in 3e0da46

    
           project 
        
            ├── columns: i:8 arr:9 s:10 s:11 i:12 s:13 array:14 array:15 
        
            ├── immutable 
        
            ├── scan a 
        
            │    └── columns: a.i:2 a.s:4 a.arr:5 
        
            └── projections

Do I need to hardcode these? Or will these be generated via EXPLAIN command? Given that they would be even present in other tests that I write, is there any documentation that I can follow on how to generate these?

mgartner · 2022-02-01T00:40:36Z

So does this mean the below structure (from contains implementation) for deciding the index column and the constant value is not required in case of overlaps (i.e. left = index and right = const always)?

It shouldn't be after you update the CommuteVar rule to include the Overlaps operator. It currently only commutes variables to the left and constants to the right for the operators listed here. Sorry, I don't think my previous message made it clear that you'll need to update the CommuteVar rule. Updating this rule and the associated test was what I was suggesting could be its own commit/PR.

Also, I forgot to mention any pointers about these .opt files in my previous post. These are optgen files. Optgen is a language we created for writing optimizer transformation rules that compile to Go code. You can read more about opgten here. There's more information on normalization and exploration rules here.

Do I need to hardcode these? Or will these be generated via EXPLAIN command? Given that they would be even present in other tests that I write, is there any documentation that I can follow on how to generate these?

These are data driven tests with a command, query, and expected output. I've annotated a test to explain a bit more:

norm                        <----- "norm" is the command telling the test runner to print out the normalized query tree.
SELECT COALESCE(i) FROM a   <----- The query to normalize.
----
project                     <---
 ├── columns: coalesce:8       |
 ├── scan a                    |-- The expected expression tree. Can be auto-regenerated.
 │    └── columns: i:2         |
 └── projections               |
      └── i:2               <---

The supported optimizer data-driven test commands and options are documented here.

The output of data driven tests can be regenerated with the -rewrite flag. For example, to regenerate the test output in pkg/sql/opt/norm/testdata/scalar you can run:

make test PKG=pkg/sql/opt/norm TESTS='TestNormRules/scalar' TESTFLAGS='-rewrite'

If you want to rewrite all test files in the norm package you could do:

make test PKG=pkg/sql/opt/norm TESTFLAGS='-rewrite'

You can do the same for any package with datadriven tests.

I am sorry but I don't quite understand how the above tests need to be modified to validate the commutativity property of the && operator. Would the following be a step in the right direction?

The test won't validate that the operator is commutative, but it will validate that your changes to the CommuteVar rule work as expected. You can add something like ARRAY[1, 2] && arr AS o to the SELECTed columns. The output should show that arr and ARRAY[1,2] have been flipped in the projected expression.

RajivTS · 2022-02-02T00:33:24Z

Thanks a bunch @mgartner! I will send out a draft PR once I have the first working version of the changes ready.

RajivTS · 2022-02-08T00:11:27Z

Hi @mgartner

I have created the first draft version of the PR addressing this issue. Note that this does not contain the changes for Overlaps operator commutativity. That will be a separate PR.

…dexes Previously, we did not support index acceleration for queries involving array overlaps (&&). This change adds support for using the inverted index with && expressions on Array columns. When there is an inverted index available, a scan will be done on the Array column using the spans found from the constant value. Fixes: cockroachdb#75477 Release note (performance improvement): Expressions using the overlaps (&&) operator for arrays now support index-acceleration for faster execution in some cases. Release justification: low risk, high benefit changes to existing functionality.

mgartner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) good first issue E-quick-win Likely to be a quick win for someone experienced. labels Jan 25, 2022

blathers-crl bot added the T-sql-queries SQL Queries Team label Jan 25, 2022

mgartner mentioned this issue Jan 25, 2022

opt: inverted index improvements #59331

Open

31 tasks

RajivTS mentioned this issue Feb 8, 2022

opt: index accelerate && (overlaps) expressions for array inverted indexes #76212

Closed

This was referenced Mar 1, 2022

opt: index accelerate && (overlaps) expressions for array inverted indexes #77200

Closed

opt: index accelerate && (overlaps) expressions for array inverted indexes #77418

Merged

craig bot closed this as completed in 6d9aee6 Apr 9, 2022

jlinder added the sync-me-3 label May 24, 2022

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to Done in SQL Queries Jul 24, 2023

jctrouble mentioned this issue Aug 20, 2024

Need to update to a newer CRDB version to support S2Cell index queries interuss/dss#1070

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt: index accelerate array overlap operator `&&` #75477

opt: index accelerate array overlap operator `&&` #75477

mgartner commented Jan 25, 2022 •

edited by cockroach-jira-scripts

Loading

RajivTS commented Jan 25, 2022

mgartner commented Jan 26, 2022

RajivTS commented Jan 27, 2022 •

edited

Loading

mgartner commented Jan 28, 2022

RajivTS commented Jan 31, 2022

mgartner commented Feb 1, 2022

RajivTS commented Feb 2, 2022

RajivTS commented Feb 8, 2022

opt: index accelerate array overlap operator && #75477

opt: index accelerate array overlap operator && #75477

Comments

mgartner commented Jan 25, 2022 • edited by cockroach-jira-scripts Loading

RajivTS commented Jan 25, 2022

mgartner commented Jan 26, 2022

RajivTS commented Jan 27, 2022 • edited Loading

mgartner commented Jan 28, 2022

RajivTS commented Jan 31, 2022

mgartner commented Feb 1, 2022

RajivTS commented Feb 2, 2022

RajivTS commented Feb 8, 2022

opt: index accelerate array overlap operator `&&` #75477

opt: index accelerate array overlap operator `&&` #75477

mgartner commented Jan 25, 2022 •

edited by cockroach-jira-scripts

Loading

RajivTS commented Jan 27, 2022 •

edited

Loading