Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opt: index accelerate array overlap operator && #75477

Closed
Tracked by #59331
mgartner opened this issue Jan 25, 2022 · 8 comments · Fixed by #77418
Closed
Tracked by #59331

opt: index accelerate array overlap operator && #75477

mgartner opened this issue Jan 25, 2022 · 8 comments · Fixed by #77418
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-quick-win Likely to be a quick win for someone experienced. good first issue T-sql-queries SQL Queries Team

Comments

@mgartner
Copy link
Collaborator

mgartner commented Jan 25, 2022

The array overlap operator, &&, provides a convenient syntax for filtering for arrays that have one more more elements in common with another array. However, the && operator is not index accelerated, so an efficient inverted index scan in not planned. The current workaround is to rewrite the query by combining the containment operator with OR expressions:

CREATE TABLE movies (
  id INT PRIMARY KEY,
  name STRING,
  tags STRING [],
  INVERTED INDEX (tags)
);

-- With the overlap operator, the query is not index accerlated:
EXPLAIN SELECT * FROM movies WHERE tags && ARRAY['sci-fi', 'fantasy'];
                     info
-----------------------------------------------
  distribution: full
  vectorized: true

  • filter
  │ filter: tags && ARRAY['sci-fi','fantasy']
  │
  └── • scan
        missing stats
        table: movies@primary
        spans: FULL SCAN
(10 rows)

-- This alternative query returns the same results and is index accelerated:
EXPLAIN SELECT * FROM movies WHERE tags @> ARRAY['sci-fi'] OR tags @> ARRAY['fantasy'];
                                         info
--------------------------------------------------------------------------------------
  distribution: full
  vectorized: true

  • index join
  │ estimated row count: 0
  │ table: movies@primary
  │
  └── • inverted filter
      │ inverted column: tags_inverted_key
      │ num spans: 2
      │
      └── • scan
            estimated row count: 0 (11% of the table; stats collected 2 minutes ago)
            table: movies@movies_tags_idx
            spans: 2 spans
(15 rows)

Jira issue: CRDB-12686

@mgartner mgartner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) good first issue E-quick-win Likely to be a quick win for someone experienced. labels Jan 25, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jan 25, 2022
@RajivTS
Copy link
Contributor

RajivTS commented Jan 25, 2022

Hi, I would like to take this up but I am fairly new to CockroachDB and would need some pointers to get started. Let me know if that's ok :)

@mgartner
Copy link
Collaborator Author

Hi @RajivTS! You're absolutely welcome to give this one a shot. Now that I think about it, I may have incorrectly labelled this a "good first issue" - it's a bit challenging for a newcomer. But not impossible, so don't let me deter you!

This issue requires recognizing the && operator (a memo.OverlapsExpr within the optimizer) in a filter and generating a semantically equivalent inverted.Expression that can be used to perform an inverted index scan. A good starting place would be to look at #61219 where the <@ (contained-by operator) was index accelerated. You'll be extending the same code and adding similar tests. I'm happy to answer any specific questions that come up.

@RajivTS
Copy link
Contributor

RajivTS commented Jan 27, 2022

Hi @mgartner!

Thank you for the high level overview and the reference PR. Based on what I have seen so far, I have the following understanding of the changes needed to address this issue. I might be way off the mark here so please do correct me wherever I am wrong.

The set of changes needed (except tests) are:

  • Add a case for tree.Overlaps in the Operator.Symbol switch in NewJSONOrArrayDatumsToInvertedExpr function.
  • Add a case for *memo.OverlapsExpr in extractInvertedFilterConditionFromLeaf which invokes the newly created extractJSONOrArrayOverlapsCondition. The implementation would remain similar to the contains/containedBy counterpart with respect to the selection of indexed column and constant value from the left and right operands.
  • Once the operands are decided and the constant datum is extracted, the new getInvertedExprForJSONOrArrayIndexForOverlaps function is invoked which returns the final inverted expression. This function depends on the equivalent encoding function.
  • Implement an encoding (e.g. EncodeOverlapsInvertedIndexSpans) for the && operator in index_encoding.go that returns an inverted.Expression which is a span representing the union of all paths through the array (overlaps == even one element in common works).
  • The logic in EncodeOverlapsInvertedIndexSpans would be very similar (for the most part) to EncodeContainedArrayInvertedIndexSpans since both of them are interested in union.
  • The two notable differences between the encoding for overlaps and containedBy are pertaining to empty array and tightness of the resultant span expression.
  • The encoding for containedBy always returns an empty array span expression since it is contained by everything. But for overlaps to evaluate to true, (I think) there must be a non-empty overlap so we don't need an empty array span expression.
  • In contrast to the containedBy encoding, the final inverted expression generated for overlaps should be tight given that we don't need to further filter out the results returned by this scan.

I also have a few open questions (and will probably have more as I progress):

  • Will the && operator always be used in the structure shown above i.e. array column on left and const array on right? Or can we have indexed columns on both sides of the operator? If yes, then would that make a difference? Do we need to handle it separately?
  • Specifically if there is no constant array on either side of the operator and we instead have a JSON fetch operator, should it be handled similar to the contains/containedBy case?
  • What should be the behavior of overlaps with null values, i.e. should SELECT ARRAY['A', Null] && ARRAY['B', Null] return true?

@mgartner
Copy link
Collaborator Author

  • Implement an encoding (e.g. EncodeOverlapsInvertedIndexSpans) for the && operator in index_encoding.go that returns an inverted.Expression which is a span representing the union of all paths through the array (overlaps == even one element in common works).
  • The logic in EncodeOverlapsInvertedIndexSpans would be very similar (for the most part) to EncodeContainedArrayInvertedIndexSpans since both of them are interested in union.
  • The two notable differences between the encoding for overlaps and containedBy are pertaining to empty array and tightness of the resultant span expression.
  • In contrast to the containedBy encoding, the final inverted expression generated for overlaps should be tight given that we don't need to further filter out the results returned by this scan.

This sounds right. Beware of NULL semantics - what is the behavior when there are NULLs in either or both arrays on each side of the ?? operator. NULLs always act in surprising ways 😈 . Here are some examples of the behavior in Postgres:

marcus=# SELECT ARRAY[1]::INT[] && ARRAY[1]::INT[];
 ?column?
----------
 t
(1 row)

marcus=# SELECT ARRAY[1]::INT[] && ARRAY[1, NULL]::INT[];
 ?column?
----------
 t
(1 row)

marcus=# SELECT ARRAY[1, NULL]::INT[] && ARRAY[1]::INT[];
 ?column?
----------
 t
(1 row)

marcus=# SELECT ARRAY[1, NULL]::INT[] && ARRAY[NULL]::INT[];
 ?column?
----------
 f
(1 row)

marcus=# SELECT ARRAY[NULL]::INT[] && ARRAY[1, NULL]::INT[];
 ?column?
----------
 f
(1 row)

marcus=# SELECT ARRAY[NULL]::INT[] && ARRAY[NULL]::INT[];
 ?column?
----------
 f
(1 row)

The logic might be as simple as this block of code:

for _, key := range keys {
spanExpr := inverted.ExprForSpan(
inverted.MakeSingleValSpan(key), false, /* tight */
)
invertedExpr = inverted.Or(invertedExpr, spanExpr)
}

  • Will the && operator always be used in the structure shown above i.e. array column on left and const array on right? Or can we have indexed columns on both sides of the operator? If yes, then would that make a difference? Do we need to handle it separately?

Great question! We already have a normalization rule that commutes binary operators so that variables are always on the left, so that other rules and logic doesn't need to handle both cases. && is commutative so we should add Overlaps here:

(Eq | Ne | Is | IsNot | Plus | Mult | Bitand | Bitor | Bitxor

You can extend this test to make sure it works properly:

norm expect=CommuteVar
SELECT
(1+i) = k AS r,
(2-k) <> i AS s,
(i+1) IS NOT DISTINCT FROM k AS t,
(i-1) IS DISTINCT FROM k AS u,
(i*2) + k AS v,
(i+2) * k AS w,
(i^2) & k AS x,
(i^2) | k AS y,
(i*i) # k AS z
FROM a

Feel free to submit that as a separate commit or even a separate PR. It's a simpler, self-contained change and it might be nice to make incremental progress towards the larger goal.

  • Specifically if there is no constant array on either side of the operator and we instead have a JSON fetch operator, should it be handled similar to the contains/containedBy case?

I'm not sure I understand the question, sorry. The && operator only works for ARRAY types, not JSON, so I don't think you'll need to worry about anything JSON related.

In Postgres:

marcus=# SELECT '[1, 2]'::JSON && '[2, 3]'::JSON;
ERROR:  42883: operator does not exist: json && json
LINE 1: SELECT '[1, 2]'::JSON && '[2, 3]'::JSON;
                              ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
  • What should be the behavior of overlaps with null values, i.e. should SELECT ARRAY['A', Null] && ARRAY['B', Null] return true?

Another great question! You anticipate my warning about NULLs above 😄 . When in doubt, it's good to check the behavior of Postgres and of the latest version of CRDB to figure out the desired behavior (queries with && already work in CRDB, they just aren't index accelerated). Be sure to add tests with various combinations of NULLs for index accelerated queries so we have confidence that they behave correctly.

@RajivTS
Copy link
Contributor

RajivTS commented Jan 31, 2022

Thank you @mgartner for the clarifications! I do have a few remaining questions before I can complete my changes.

We already have a normalization rule that commutes binary operators so that variables are always on the left, so that other rules and logic doesn't need to handle both cases

So does this mean the below structure (from contains implementation) for deciding the index column and the constant value is not required in case of overlaps (i.e. left = index and right = const always)?

if isIndexColumn(j.tabID, j.index, left, j.computedColumns) && memo.CanExtractConstDatum(right) {
// When the first argument is a variable or expression corresponding to the
// index column and the second argument is a constant, we get the
// InvertedExpression for left @> right or left <@ right.
indexColumn, constantVal = left, right
} else if isIndexColumn(j.tabID, j.index, right, j.computedColumns) && memo.CanExtractConstDatum(left) {
// When the second argument is a variable or expression corresponding to
// the index column and the first argument is a constant, we get the
// equivalent InvertedExpression for right <@ left or right @> left.
indexColumn, constantVal = right, left
containedBy = !containedBy

You can extend this test to make sure it works properly:

cockroach/pkg/sql/opt/norm/testdata/rules/scalar

Lines 27 to 39 in a812e22
norm expect=CommuteVar
SELECT
(1+i) = k AS r,
(2-k) <> i AS s,
(i+1) IS NOT DISTINCT FROM k AS t,
(i-1) IS DISTINCT FROM k AS u,
(i2) + k AS v,
(i+2) * k AS w,
(i^2) & k AS x,
(i^2) | k AS y,
(i
i) # k AS z
FROM a

I am sorry but I don't quite understand how the above tests need to be modified to validate the commutativity property of the && operator. Would the following be a step in the right direction?

SELECT
    (ARRAY[k]::INT[] && arr) AS ov,
    (arr && ARRAY[k]::INT[]) IS NOT DISTINCT FROM ov AS result
FROM a

I am also having difficulty in following the project output that is added below each test which shows the columns, scans, projections etc.

project
├── columns: i:8 arr:9 s:10 s:11 i:12 s:13 array:14 array:15
├── immutable
├── scan a
│ └── columns: a.i:2 a.s:4 a.arr:5
└── projections

Do I need to hardcode these? Or will these be generated via EXPLAIN command? Given that they would be even present in other tests that I write, is there any documentation that I can follow on how to generate these?

@mgartner
Copy link
Collaborator Author

mgartner commented Feb 1, 2022

So does this mean the below structure (from contains implementation) for deciding the index column and the constant value is not required in case of overlaps (i.e. left = index and right = const always)?

It shouldn't be after you update the CommuteVar rule to include the Overlaps operator. It currently only commutes variables to the left and constants to the right for the operators listed here. Sorry, I don't think my previous message made it clear that you'll need to update the CommuteVar rule. Updating this rule and the associated test was what I was suggesting could be its own commit/PR.

Also, I forgot to mention any pointers about these .opt files in my previous post. These are optgen files. Optgen is a language we created for writing optimizer transformation rules that compile to Go code. You can read more about opgten here. There's more information on normalization and exploration rules here.

Do I need to hardcode these? Or will these be generated via EXPLAIN command? Given that they would be even present in other tests that I write, is there any documentation that I can follow on how to generate these?

These are data driven tests with a command, query, and expected output. I've annotated a test to explain a bit more:

norm                        <----- "norm" is the command telling the test runner to print out the normalized query tree.
SELECT COALESCE(i) FROM a   <----- The query to normalize.
----
project                     <---
 ├── columns: coalesce:8       |
 ├── scan a                    |-- The expected expression tree. Can be auto-regenerated.
 │    └── columns: i:2         |
 └── projections               |
      └── i:2               <---

The supported optimizer data-driven test commands and options are documented here.

The output of data driven tests can be regenerated with the -rewrite flag. For example, to regenerate the test output in pkg/sql/opt/norm/testdata/scalar you can run:

make test PKG=pkg/sql/opt/norm TESTS='TestNormRules/scalar' TESTFLAGS='-rewrite'

If you want to rewrite all test files in the norm package you could do:

make test PKG=pkg/sql/opt/norm TESTFLAGS='-rewrite'

You can do the same for any package with datadriven tests.

I am sorry but I don't quite understand how the above tests need to be modified to validate the commutativity property of the && operator. Would the following be a step in the right direction?

The test won't validate that the operator is commutative, but it will validate that your changes to the CommuteVar rule work as expected. You can add something like ARRAY[1, 2] && arr AS o to the SELECTed columns. The output should show that arr and ARRAY[1,2] have been flipped in the projected expression.

@RajivTS
Copy link
Contributor

RajivTS commented Feb 2, 2022

Thanks a bunch @mgartner! I will send out a draft PR once I have the first working version of the changes ready.

@RajivTS
Copy link
Contributor

RajivTS commented Feb 8, 2022

Hi @mgartner

I have created the first draft version of the PR addressing this issue. Note that this does not contain the changes for Overlaps operator commutativity. That will be a separate PR.

RajivTS added a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
…dexes

Previously, we did not support index acceleration for queries involving
array overlaps (&&).

This change adds support for using the inverted index with && expressions on
Array columns. When there is an inverted index available, a scan will be done on
the Array column using the spans found from the constant value.

Fixes: cockroachdb#75477

Release note (performance improvement): Expressions using the overlaps (&&)
operator for arrays now support index-acceleration for faster execution in
some cases.

Release justification: low risk, high benefit changes to existing
functionality.
RajivTS added a commit to RajivTS/cockroach that referenced this issue Apr 9, 2022
…dexes

Previously, we did not support index acceleration for queries involving
array overlaps (&&).

This change adds support for using the inverted index with && expressions on
Array columns. When there is an inverted index available, a scan will be done on
the Array column using the spans found from the constant value.

Fixes: cockroachdb#75477

Release note (performance improvement): Expressions using the overlaps (&&)
operator for arrays now support index-acceleration for faster execution in
some cases.

Release justification: low risk, high benefit changes to existing
functionality.
RajivTS added a commit to RajivTS/cockroach that referenced this issue Apr 9, 2022
…dexes

Previously, we did not support index acceleration for queries involving
array overlaps (&&).

This change adds support for using the inverted index with && expressions on
Array columns. When there is an inverted index available, a scan will be done on
the Array column using the spans found from the constant value.

Fixes: cockroachdb#75477

Release note (performance improvement): Expressions using the overlaps (&&)
operator for arrays now support index-acceleration for faster execution in
some cases.

Release justification: low risk, high benefit changes to existing
functionality.
@craig craig bot closed this as completed in 6d9aee6 Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-quick-win Likely to be a quick win for someone experienced. good first issue T-sql-queries SQL Queries Team
Projects
Archived in project
3 participants