-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Short-circuit evaluation for CaseWhen
#2068
Conversation
let schema = batch.schema(); | ||
|
||
// CASE a when 0 THEN float64(null) ELSE 25.0 / cast(a, float64) END | ||
let when1 = lit(ScalarValue::Int32(Some(0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails since we are currently evaluating else
first for
CASE expr WHEN value THEN result ELSE result END
Regardless of a tuple has entered when branches
before and should bypass else
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we are evaluating case_when
from the end to the beginning. This is problematic since short-circuit evaluation is adopted by most engines, including PostgreSQL, Oracle, and SQL Server. Therefore, chances are users would express computation logic that would fail in later when_thens
for tuples that should have been computed previously and bypassed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can evaluate case_when
from beginning to the end and use evaluate_selection
for the when_expr
so that we can omit the following computation of the record whose when_expr
is already true. Do I get your idea?
then
clause for true when
'sCaseWhen
let when_value = self.when_then_expr[i].0.evaluate(batch)?; | ||
let when_value = self.when_then_expr[i] | ||
.0 | ||
.evaluate_selection(batch, &remainder)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Great! It's short-circuit now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- I had some questions about scatter
but otherwise looks great
indices.push(i as u64); | ||
} | ||
} | ||
let indices = UInt64Array::from_iter_values(indices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you can't just update the null
mask of the source batch to be null
where validity is false
because things like the divide kernel will still throw runtime exceptions if the data is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think divide kernel works correctly to deal with only valid indices.
Are you suggesting I should create new RecordBatch by masking the current batch instead of the take-then-scatter way? Should I create bitmaps from existing ones for each array with the help of arrow::bit_util
, or do I miss something handy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just thinking it might be possible to do something like the following psuedo code:
let mask = and(old_array.null_mask(), selection);
let new_array = old_array.replace_null_mask(mask);
let result = compute_expr(new_array);
And skip having to scatter / gather
However, given this code works and is covered by tests maybe we cn revisit the approach if there is some performance or correctness issue in the future
fn scatter(mask: &BooleanArray, truthy: &dyn Array) -> Result<ArrayRef> { | ||
let truthy = truthy.data(); | ||
|
||
let mut mutable = MutableArrayData::new(vec![&*truthy], true, mask.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am probably missing something here but this code looks like it always creates BooleanArrays
even when truthy
is some other type -- in the case examples you have, the resulting expression is always boolean, but I wonder if this is always the case
Perhaps it is worth an assert!
that truthy.data_type() == DataType::Boolean
?
Even better would be some unit tests showing how scatter worked (for boolean and non boolean arrays)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's extending with mutable.extend(0, true_pos, true_pos + len);
array from index 0 (the only truthy array), so the result is of the same type with truthy. Test added as well.
Co-authored-by: Andrew Lamb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Jie Han <[email protected]>
Co-authored-by: Jie Han <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @yjshen
indices.push(i as u64); | ||
} | ||
} | ||
let indices = UInt64Array::from_iter_values(indices); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just thinking it might be possible to do something like the following psuedo code:
let mask = and(old_array.null_mask(), selection);
let new_array = old_array.replace_null_mask(mask);
let result = compute_expr(new_array);
And skip having to scatter / gather
However, given this code works and is covered by tests maybe we cn revisit the approach if there is some performance or correctness issue in the future
Thanks @doki23 for the review |
Which issue does this PR close?
Closes #2064.
Rationale for this change
As reported by #2064, we are currently evaluating
then expr
andelse expr
for all tuples, regardless a tuple might already fail thewhen expr
already.What changes are included in this PR?
Evaluate expressions sequentially as they appear in CaseWhen, short-circuit when possible.
Are there any user-facing changes?
No.