Short-circuit evaluation for `CaseWhen` #2068

yjshen · 2022-03-23T09:50:36Z

Which issue does this PR close?

Closes #2064.

Rationale for this change

As reported by #2064, we are currently evaluating then expr and else expr for all tuples, regardless a tuple might already fail the when expr already.

What changes are included in this PR?

Evaluate expressions sequentially as they appear in CaseWhen, short-circuit when possible.

Are there any user-facing changes?

No.

yjshen · 2022-03-23T09:58:07Z

datafusion-physical-expr/src/expressions/case.rs

+        let schema = batch.schema();
+
+        // CASE a when 0 THEN float64(null) ELSE 25.0 / cast(a, float64)  END
+        let when1 = lit(ScalarValue::Int32(Some(0)));


This test fails since we are currently evaluating else first for

CASE expr WHEN value THEN result ELSE result END

Regardless of a tuple has entered when branches before and should bypass else.

Also, we are evaluating case_when from the end to the beginning. This is problematic since short-circuit evaluation is adopted by most engines, including PostgreSQL, Oracle, and SQL Server. Therefore, chances are users would express computation logic that would fail in later when_thens for tuples that should have been computed previously and bypassed.

We can evaluate case_when from beginning to the end and use evaluate_selection for the when_expr so that we can omit the following computation of the record whose when_expr is already true. Do I get your idea?

doki23 · 2022-03-24T03:52:08Z

datafusion-physical-expr/src/expressions/case.rs

-            let when_value = self.when_then_expr[i].0.evaluate(batch)?;
+            let when_value = self.when_then_expr[i]
+                .0
+                .evaluate_selection(batch, &remainder)?;


👍 Great! It's short-circuit now.

datafusion-physical-expr/src/physical_expr.rs

alamb

Looks good to me -- I had some questions about scatter but otherwise looks great

datafusion-physical-expr/src/physical_expr.rs

alamb · 2022-03-25T17:50:01Z

datafusion-physical-expr/src/physical_expr.rs

+                indices.push(i as u64);
+            }
+        }
+        let indices = UInt64Array::from_iter_values(indices);


I assume you can't just update the null mask of the source batch to be null where validity is false because things like the divide kernel will still throw runtime exceptions if the data is 0?

No, I think divide kernel works correctly to deal with only valid indices.
Are you suggesting I should create new RecordBatch by masking the current batch instead of the take-then-scatter way? Should I create bitmaps from existing ones for each array with the help of arrow::bit_util, or do I miss something handy?

I was just thinking it might be possible to do something like the following psuedo code:

let mask = and(old_array.null_mask(), selection); let new_array = old_array.replace_null_mask(mask); let result = compute_expr(new_array);

And skip having to scatter / gather

However, given this code works and is covered by tests maybe we cn revisit the approach if there is some performance or correctness issue in the future

datafusion-physical-expr/src/physical_expr.rs

alamb · 2022-03-25T18:00:16Z

datafusion-physical-expr/src/physical_expr.rs

+fn scatter(mask: &BooleanArray, truthy: &dyn Array) -> Result<ArrayRef> {
+    let truthy = truthy.data();
+
+    let mut mutable = MutableArrayData::new(vec![&*truthy], true, mask.len());


I am probably missing something here but this code looks like it always creates BooleanArrays even when truthy is some other type -- in the case examples you have, the resulting expression is always boolean, but I wonder if this is always the case

Perhaps it is worth an assert! that truthy.data_type() == DataType::Boolean?

Even better would be some unit tests showing how scatter worked (for boolean and non boolean arrays)

It's extending with mutable.extend(0, true_pos, true_pos + len); array from index 0 (the only truthy array), so the result is of the same type with truthy. Test added as well.

Co-authored-by: Andrew Lamb <[email protected]>

doki23

LGTM

datafusion-physical-expr/src/expressions/case.rs

Co-authored-by: Jie Han <[email protected]>

alamb

Thanks again @yjshen

alamb · 2022-03-27T10:18:41Z

datafusion-physical-expr/src/physical_expr.rs

+                indices.push(i as u64);
+            }
+        }
+        let indices = UInt64Array::from_iter_values(indices);


I was just thinking it might be possible to do something like the following psuedo code:

let mask = and(old_array.null_mask(), selection); let new_array = old_array.replace_null_mask(mask); let result = compute_expr(new_array);

And skip having to scatter / gather

However, given this code works and is covered by tests maybe we cn revisit the approach if there is some performance or correctness issue in the future

alamb · 2022-03-27T10:19:07Z

Thanks @doki23 for the review

WIP: case when expr works

6a61735

yjshen commented Mar 23, 2022

View reviewed changes

short-circuit case_when

bc78f97

yjshen marked this pull request as ready for review March 24, 2022 03:17

yjshen changed the title ~~WIP: case when should only evaluate then clause for true when's~~ Short-circuit evaluation for CaseWhen Mar 24, 2022

doki23 reviewed Mar 24, 2022

View reviewed changes

else

602034c

alamb reviewed Mar 25, 2022

View reviewed changes

datafusion-physical-expr/src/physical_expr.rs Outdated Show resolved Hide resolved

alamb approved these changes Mar 25, 2022

View reviewed changes

Update datafusion-physical-expr/src/physical_expr.rs

be7005a

Co-authored-by: Andrew Lamb <[email protected]>

doki23 approved these changes Mar 26, 2022

View reviewed changes

datafusion-physical-expr/src/expressions/case.rs Outdated Show resolved Hide resolved

datafusion-physical-expr/src/expressions/case.rs Outdated Show resolved Hide resolved

yjshen and others added 3 commits March 26, 2022 17:13

resolve comments

5d64bdc

Update datafusion-physical-expr/src/expressions/case.rs

5a0c78f

Co-authored-by: Jie Han <[email protected]>

Update datafusion-physical-expr/src/expressions/case.rs

d42b877

Co-authored-by: Jie Han <[email protected]>

alamb approved these changes Mar 27, 2022

View reviewed changes

alamb merged commit ff110d6 into apache:master Mar 27, 2022

This was referenced Mar 29, 2022

panic range end index 3 out of range for slice of length 2 when evaluating CASE expression #2117

Closed

Fix case evaluation with NULLs #2118

Merged

yjshen deleted the case_fix branch April 22, 2022 08:31

sunchao mentioned this pull request Jan 27, 2023

Evaluate Kernel under Selection / Short-Circuiting Filter Evaluation apache/arrow-rs#3620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short-circuit evaluation for `CaseWhen` #2068

Short-circuit evaluation for `CaseWhen` #2068

yjshen commented Mar 23, 2022 •

edited

Loading

yjshen Mar 23, 2022 •

edited

Loading

yjshen Mar 23, 2022 •

edited

Loading

doki23 Mar 24, 2022 •

edited

Loading

doki23 Mar 24, 2022

alamb left a comment

alamb Mar 25, 2022

yjshen Mar 26, 2022

alamb Mar 27, 2022

alamb Mar 25, 2022

yjshen Mar 26, 2022 •

edited

Loading

doki23 left a comment

alamb left a comment

alamb Mar 27, 2022

alamb commented Mar 27, 2022

Short-circuit evaluation for CaseWhen #2068

Short-circuit evaluation for CaseWhen #2068

Conversation

yjshen commented Mar 23, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

yjshen Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

yjshen Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

doki23 Mar 24, 2022 • edited Loading

Choose a reason for hiding this comment

doki23 Mar 24, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 25, 2022

Choose a reason for hiding this comment

yjshen Mar 26, 2022

Choose a reason for hiding this comment

alamb Mar 27, 2022

Choose a reason for hiding this comment

alamb Mar 25, 2022

Choose a reason for hiding this comment

yjshen Mar 26, 2022 • edited Loading

Choose a reason for hiding this comment

doki23 left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 27, 2022

Choose a reason for hiding this comment

alamb commented Mar 27, 2022

Short-circuit evaluation for `CaseWhen` #2068

Short-circuit evaluation for `CaseWhen` #2068

yjshen commented Mar 23, 2022 •

edited

Loading

yjshen Mar 23, 2022 •

edited

Loading

yjshen Mar 23, 2022 •

edited

Loading

doki23 Mar 24, 2022 •

edited

Loading

yjshen Mar 26, 2022 •

edited

Loading