Short circuit AND/OR in ANSI mode #4760

amahussein · 2022-02-11T18:05:12Z

fixes #4526

Compatibility:

For AND if the RHS has side effects we only process it for rows where the LHS is not false (this includes nulls).
For OR if the RHS has side effects we only process it for rows where the LHS is not true (this includes nulls).

Testing:

Added two integration tests in logic_test.py

Code changes:

Override the implementation of columnarEval for both GpuAnd and GpuOR
The above change requires some refactoring to ConditionalExpressions since the new code needs to use some of the methods defined there such as: gather, filterBatch,

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

jlowe · 2022-02-11T18:41:56Z

integration_tests/src/main/python/logic_test.py

+@pytest.mark.parametrize('expr', _get_arithmetic_overflow_expr('OR'))
+def test_or_with_side_effect(expr, ansi_enabled, lhs_predicate):
+    ansi_conf = {'spark.sql.ansi.enabled': ansi_enabled}
+    if ansi_enabled == 'true' and (not(lhs_predicate)):


Suggested change

if ansi_enabled == 'true' and (not(lhs_predicate)):

if ansi_enabled == 'true' and not lhs_predicate:

jlowe · 2022-02-11T18:43:51Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala

@@ -66,6 +67,125 @@ case class GpuAnd(left: Expression, right: Expression) extends CudfBinaryOperato

  override def binaryOp: BinaryOp = BinaryOp.NULL_LOGICAL_AND
  override def astOperator: Option[BinaryOperator] = Some(ast.BinaryOperator.NULL_LOGICAL_AND)
+
+  protected def filterBatch(


Rather than copying all this code, I expected GpuAnd to use the existing GpuConditionalExpression trait.

Yes absolutely agree.
I was thinking to create a new trait for helpers related to the ColumnVector since those methods are not really specific to "ConditionalExpression". WDYT?

Maybe, although it starts to get outside the scope of the PR a bit. If we are moving stuff around, filterBatch should be part of GpuFilter which does something almost identical to this method.

As @jlowe said, adding with GpuConditionalExpression will bring the needed methods in and avoid the need to duplicate code. Perhaps a better name for this trait now is GpuConditionalEvaluation?

I could not add GpuConditionalExpression out of the box because of conflicts between the traits tree. So, I pulled the generic methods into a new trait that can be extended by the GpuAnd/GpuOr.
Another approach would be to move that to Helper object, but I did not like that because it would cause more scattered changes in conditionalExpressions.scala

gerashegalov · 2022-02-11T18:52:21Z

integration_tests/src/main/python/logic_test.py

+@pytest.mark.parametrize('ansi_enabled', ['false', 'true'])
+@pytest.mark.parametrize('lhs_predicate', [False, True])
+@pytest.mark.parametrize('expr', _get_arithmetic_overflow_expr('OR'))
+def test_or_with_side_effect(expr, ansi_enabled, lhs_predicate):


it looks like the difference in tests is rather small and you could combine them parametrized 'AND', 'OR'. Then you would not need _get_arithmetic_overflow_expr / _get_arithmetic_overflow_expr to be global

Done!
Initially, when I looked at the existing python tests, I thought that there is preference to keep separate tests for readability.

gerashegalov · 2022-02-11T19:09:44Z

integration_tests/src/main/python/logic_test.py

+# process the RHS if they cannot figure out the result from just the LHS.
+# Tests the GPU short-circuits the predicates without throwing Exception in ANSI mode.
+@pytest.mark.parametrize('ansi_enabled', ['true', 'false'])
+@pytest.mark.parametrize('lhs_predicate', [True, False])


can we add a Null lhs predicate for completeness?

Good Suggestion! Done!

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala

revans2 · 2022-02-15T17:39:10Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala

+          }
+        } else {
+          // replace nulll values
+          val lhsNoNulls = withResource(Scalar.fromBool(true)) { trueScalar =>


How expensive is this replacement? and should we be checking if there are any nulls first?

I looked up all references to columnVector.replaceNulls() and I did not see it guarded by check for nulls. So, I decided to follow the same pattern thinking that replaceNulls() is not expensive.
I fixed that in the recent commits.

I mostly wanted some kind of evidence if it is slow or not. I think in this case my gut is probably okay, but it might be nice to have something better than that. Do you have any benchmarks that we could test this with? I know that AND/OR ended up being very slow int he past and caused measurable degradation in some TPC-DS queries. So it should be possible to have a project with lots of ANDS/ORs in it and see.

I was hoping you never say that :-)
I will take a look into benchmarking. It may take me some time to get back with meaningful data though.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala

andygrove · 2022-02-16T17:48:52Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala

+                case (f: GpuScalar) =>
+                  doColumnar(lhsBool, f)


This path is not covered by the current tests. Would we ever hit this path? I don't think that scalar expressions can have side-effects and we would only call this method if the RHS has side-effects.

Wow that is an interesting situation. The only place right now where we can return a Scalar from processing when the input is not a foldable constant is some corner cases with Coalesce, and Coalesce can totally have side effects associated with it for down stream tasks that are a part of it. Perhaps we should try to write a test for this just to be sure...

I've been trying to write a test that hits the scalar path and have failed to come up with anything so far. It looks like Coalesce will only return a scalar if the first non-null parameter is a scalar, but Spark will optimize the Coalesce out in that case. I am back to wondering if the scalar path is even possible, but maybe I am missing the edge case here.

I don't think Spark had an optimization in older versions, so you might try that. Also this is enough of a corner case that I am fine if it as is.

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein · 2022-02-17T16:04:49Z

build

revans2 · 2022-02-17T16:46:36Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala


+trait GpuColumnVectorHelper extends Arm {


Can we find a better name for this? It was GpuConditionalExpression so it was clear that it was supported to be for use with those, but now it is a generic utility library with an even more generic name and no clear contract on how these APIs should be used. Could we change this to be an object instead of a trait? Could we restore the name of GpuConditionalExpression so it is still clear that they are intended to be used with these types of classes? Also I really dislike that we are overriding isAllTrue that just does not appear to make any since to me. If we have different ways of calculating isAllTrue then we should either have separate utility APIs with a name that makes it clear how each works, or we should have a flag passed to the API to allow you to select how it works. Does that make since to you?

Yes, that makes sense.
I initially thought to use the existing Object GpuExpressionsUtils. Should that be fine? Or do you prefer to create a new Object helper , and what would be its name?

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

revans2

This looks good. I have a few nits that would be nice to address, but nothing that I think is required.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/conditionalExpressions.scala

revans2 · 2022-02-18T16:48:08Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala

+   * side-effects, such as throwing exceptions for invalid inputs.
+   *
+   * This method performs lazy evaluation on the GPU by first filtering
+   * the input batch a  where the LHS predicate is True.


This line feels like it needs to be looked at again. the "a " in between batch and where feels off.

revans2 · 2022-02-18T16:53:14Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala

+                    doColumnar(lhsBool, GpuColumnVector.from(combinedVector, dataType))
+                  }
+                case f: GpuScalar =>
+                  doColumnar(lhsBool, f)


Like was stated before this code is not covered by any test. I have my doubts that it can be covered, but I think the simplest solution to this is to call columnarEvalToColumn which will return the scalar as a column and not worry about it.

withResource(GpuExpressionsUtils.columnarEvalToColumn(rightExpr, leftTrueBatch)) { rEval => withResource(gather(lhsNoNulls, rEval)) { combinedVector => doColumnar(lhsBool, GpuColumnVector.from(combinedVector, dataType)) } }

It also makes the code much smaller with the expense of extra memory in a case we don't know if it is possible to hit.

amahussein · 2022-02-18T20:31:22Z

build

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein · 2022-02-22T23:00:08Z

This looks good. I have a few nits that would be nice to address, but nothing that I think is required.

Thanks @revans2 !
I pushed one final commit to address the remaining comments.
Does it look good to you?

amahussein added 3 commits February 11, 2022 10:02

Add integration tests for shortcircuit logic

46045e0

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Add columnarEvalWithSideEffects to predicate AND

0aea97d

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

fix LOGICAL_AND

777ac38

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

sameerz added the bug Something isn't working label Feb 11, 2022

sameerz assigned amahussein Feb 11, 2022

sameerz added this to the Feb 14 - Feb 25 milestone Feb 11, 2022

amahussein marked this pull request as draft February 11, 2022 18:42

jlowe reviewed Feb 11, 2022

View reviewed changes

gerashegalov reviewed Feb 11, 2022

View reviewed changes

andygrove reviewed Feb 11, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala Outdated Show resolved Hide resolved

amahussein added 7 commits February 14, 2022 10:12

remove globals from logic_test

d3b21f1

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

add schema to test_logic

96d43d7

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

fix Logical_and with NULL LHS

7d94599

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

remove redundant logical test

707571d

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Merge branch 'branch-22.04' into rapids-4526-impl

2bbef30

use traits and classes to avoid copying code

7f6cc8a

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

fix Logical_OR with side effects

94011a5

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

andygrove reviewed Feb 15, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala Outdated Show resolved Hide resolved

andygrove reviewed Feb 15, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala Outdated Show resolved Hide resolved

revans2 reviewed Feb 15, 2022

View reviewed changes

andygrove reviewed Feb 15, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala Outdated Show resolved Hide resolved

sameerz changed the title ~~[WIP] Short cuircit AND/OR in ANSI mode~~ [WIP] Short circuit AND/OR in ANSI mode Feb 16, 2022

andygrove reviewed Feb 16, 2022

View reviewed changes

amahussein added 2 commits February 16, 2022 11:09

optimize replaceNulls and PR comments

0ea9aaf

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

optimize OR implementation by avoiding OP_NOT

4871edf

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

andygrove reviewed Feb 17, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala Outdated Show resolved Hide resolved

andygrove reviewed Feb 17, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/predicates.scala Outdated Show resolved Hide resolved

fix typos and blank lines

12fc06e

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein changed the title ~~[WIP] Short circuit AND/OR in ANSI mode~~ Short circuit AND/OR in ANSI mode Feb 17, 2022

amahussein marked this pull request as ready for review February 17, 2022 16:03

revans2 reviewed Feb 17, 2022

View reviewed changes

replace trait with Object and reduce code size

31af28f

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

revans2 previously approved these changes Feb 18, 2022

View reviewed changes

amahussein dismissed revans2’s stale review via 7f1e1f9 February 18, 2022 20:29

amahussein added 2 commits February 18, 2022 12:31

address last PR comments

3f89530

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Merge branch 'branch-22.04' into rapids-4526-impl

7f1e1f9

revans2 approved these changes Feb 22, 2022

View reviewed changes

revans2 merged commit 076f36e into NVIDIA:branch-22.04 Feb 22, 2022

amahussein deleted the rapids-4526-impl branch February 22, 2022 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short circuit AND/OR in ANSI mode #4760

Short circuit AND/OR in ANSI mode #4760

amahussein commented Feb 11, 2022

jlowe Feb 11, 2022

jlowe Feb 11, 2022

amahussein Feb 11, 2022

jlowe Feb 11, 2022

andygrove Feb 11, 2022

amahussein Feb 16, 2022

gerashegalov Feb 11, 2022

amahussein Feb 16, 2022

gerashegalov Feb 11, 2022

amahussein Feb 16, 2022

revans2 Feb 15, 2022

amahussein Feb 16, 2022

revans2 Feb 17, 2022

amahussein Feb 17, 2022

andygrove Feb 16, 2022

revans2 Feb 16, 2022

andygrove Feb 16, 2022

revans2 Feb 17, 2022

amahussein commented Feb 17, 2022

revans2 Feb 17, 2022 •

edited

Loading

amahussein Feb 17, 2022

revans2 left a comment

revans2 Feb 18, 2022

revans2 Feb 18, 2022

amahussein commented Feb 18, 2022

amahussein commented Feb 22, 2022

	if ansi_enabled == 'true' and (not(lhs_predicate)):
	if ansi_enabled == 'true' and not lhs_predicate:

Short circuit AND/OR in ANSI mode #4760

Short circuit AND/OR in ANSI mode #4760

Conversation

amahussein commented Feb 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amahussein commented Feb 17, 2022

revans2 Feb 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amahussein commented Feb 18, 2022

amahussein commented Feb 22, 2022

revans2 Feb 17, 2022 •

edited

Loading