Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short-circuit evaluation for CaseWhen #2068

Merged
merged 7 commits into from
Mar 27, 2022
Merged

Short-circuit evaluation for CaseWhen #2068

merged 7 commits into from
Mar 27, 2022

Conversation

yjshen
Copy link
Member

@yjshen yjshen commented Mar 23, 2022

Which issue does this PR close?

Closes #2064.

Rationale for this change

As reported by #2064, we are currently evaluating then expr and else expr for all tuples, regardless a tuple might already fail the when expr already.

What changes are included in this PR?

Evaluate expressions sequentially as they appear in CaseWhen, short-circuit when possible.

Are there any user-facing changes?

No.

let schema = batch.schema();

// CASE a when 0 THEN float64(null) ELSE 25.0 / cast(a, float64) END
let when1 = lit(ScalarValue::Int32(Some(0)));
Copy link
Member Author

@yjshen yjshen Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails since we are currently evaluating else first for

CASE expr WHEN value THEN result ELSE result END

Regardless of a tuple has entered when branches before and should bypass else.

Copy link
Member Author

@yjshen yjshen Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we are evaluating case_when from the end to the beginning. This is problematic since short-circuit evaluation is adopted by most engines, including PostgreSQL, Oracle, and SQL Server. Therefore, chances are users would express computation logic that would fail in later when_thens for tuples that should have been computed previously and bypassed.

Copy link
Contributor

@doki23 doki23 Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can evaluate case_when from beginning to the end and use evaluate_selection for the when_expr so that we can omit the following computation of the record whose when_expr is already true. Do I get your idea?

@yjshen yjshen marked this pull request as ready for review March 24, 2022 03:17
@yjshen yjshen changed the title WIP: case when should only evaluate then clause for true when's Short-circuit evaluation for CaseWhen Mar 24, 2022
let when_value = self.when_then_expr[i].0.evaluate(batch)?;
let when_value = self.when_then_expr[i]
.0
.evaluate_selection(batch, &remainder)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Great! It's short-circuit now.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- I had some questions about scatter but otherwise looks great

datafusion-physical-expr/src/physical_expr.rs Show resolved Hide resolved
indices.push(i as u64);
}
}
let indices = UInt64Array::from_iter_values(indices);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you can't just update the null mask of the source batch to be null where validity is false because things like the divide kernel will still throw runtime exceptions if the data is 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think divide kernel works correctly to deal with only valid indices.
Are you suggesting I should create new RecordBatch by masking the current batch instead of the take-then-scatter way? Should I create bitmaps from existing ones for each array with the help of arrow::bit_util, or do I miss something handy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking it might be possible to do something like the following psuedo code:

let mask = and(old_array.null_mask(), selection);
let new_array = old_array.replace_null_mask(mask);
let result = compute_expr(new_array);

And skip having to scatter / gather

However, given this code works and is covered by tests maybe we cn revisit the approach if there is some performance or correctness issue in the future

datafusion-physical-expr/src/physical_expr.rs Outdated Show resolved Hide resolved
fn scatter(mask: &BooleanArray, truthy: &dyn Array) -> Result<ArrayRef> {
let truthy = truthy.data();

let mut mutable = MutableArrayData::new(vec![&*truthy], true, mask.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably missing something here but this code looks like it always creates BooleanArrays even when truthy is some other type -- in the case examples you have, the resulting expression is always boolean, but I wonder if this is always the case

Perhaps it is worth an assert! that truthy.data_type() == DataType::Boolean?

Even better would be some unit tests showing how scatter worked (for boolean and non boolean arrays)

Copy link
Member Author

@yjshen yjshen Mar 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's extending with mutable.extend(0, true_pos, true_pos + len); array from index 0 (the only truthy array), so the result is of the same type with truthy. Test added as well.

Copy link
Contributor

@doki23 doki23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

datafusion-physical-expr/src/expressions/case.rs Outdated Show resolved Hide resolved
datafusion-physical-expr/src/expressions/case.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @yjshen

indices.push(i as u64);
}
}
let indices = UInt64Array::from_iter_values(indices);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking it might be possible to do something like the following psuedo code:

let mask = and(old_array.null_mask(), selection);
let new_array = old_array.replace_null_mask(mask);
let result = compute_expr(new_array);

And skip having to scatter / gather

However, given this code works and is covered by tests maybe we cn revisit the approach if there is some performance or correctness issue in the future

@alamb alamb merged commit ff110d6 into apache:master Mar 27, 2022
@alamb
Copy link
Contributor

alamb commented Mar 27, 2022

Thanks @doki23 for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DivideByZero for CaseWhen
3 participants