Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Coalesce function #1969

Merged
merged 11 commits into from
Apr 6, 2022
Merged

Add Coalesce function #1969

merged 11 commits into from
Apr 6, 2022

Conversation

msathis
Copy link
Contributor

@msathis msathis commented Mar 9, 2022

Which issue does this PR close?

Closes #1890.

Rationale for this change

To support coalesce function in SQL

What changes are included in this PR?

Basic coalesce support. Now everything is treated as string, will have to work to support other data types properly.

Are there any user-facing changes?

Users can use coalesce function in SQL now

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Mar 9, 2022
@msathis msathis marked this pull request as ready for review March 15, 2022 09:45
@msathis msathis changed the title Coalesce function - WIP Coalesce function Mar 15, 2022
@msathis
Copy link
Contributor Author

msathis commented Mar 15, 2022

@andygrove @xudong963 Please take a look & review when you find time.

for column_value in args {
match column_value {
ColumnarValue::Array(array_ref) => {
if array_ref.is_valid(i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation can likely be optimized (I think it should be >10x faster) by not converting values to ScalarValue and using is_valid, matching per value, but operating directly on Arrays. That would also allow optimizations like skipping running the coalesce althogether if all items in the array already are non null, or skipping handling the array if every value is null, etc.

This will involve some more work / code though, but maybe a comment that there are some big optimizations possible could be useful.

The current implementation also could likely already be optimized a bit by moving the for loop to the innermost place instead of the outermost place and/or implementing one of the rules like mentioned.

@yjshen yjshen added the api change Changes the API exposed to users of the crate label Mar 16, 2022
@yjshen
Copy link
Member

yjshen commented Mar 16, 2022

@msathis Could you please also add this to Ballista, as #2008 did?

@msathis
Copy link
Contributor Author

msathis commented Mar 16, 2022

@yjshen Yup. Thanks for pointing out. I'll add it to Ballista with the optimization @Dandandan suggested

@xudong963 xudong963 added the enhancement New feature or request label Mar 17, 2022
@alamb
Copy link
Contributor

alamb commented Mar 21, 2022

@yjshen Yup. Thanks for pointing out. I'll add it to Ballista with the optimization @Dandandan suggested

@msathis do you plan to do this as a follow on PR or this PR?

In other words is this PR ready for review again or are you still working on it?

@@ -111,6 +112,11 @@ pub fn return_type(
utf8_to_int_type(&input_expr_types[0], "character_length")
}
BuiltinScalarFunction::Chr => Ok(DataType::Utf8),
BuiltinScalarFunction::Coalesce => {
// COALESCE has multiple args and they might get coerced, get a preview of this
let coerced_types = data_types(input_expr_types, &signature(fun));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any doc or reference to guide for the common result type or the coercion type?

}

#[tokio::test]
async fn coalesce_sum_with_default_value() -> Result<()> {
Copy link
Contributor

@liukun4515 liukun4515 Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add other case with agg function from #2067 (comment)

select COALESCE(sum(c3),0) from table

All the value of c3 is null.
@msathis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. i'll update the test case!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can try this case

select COALESCE(sum(c3),0) from table group by xxx

@msathis
Copy link
Contributor Author

msathis commented Mar 24, 2022

R ready for review again or are you still working on it?

Hi @alamb , Sorry, i got busy with some office work. I'll try to revise the PR in the next couple of days. I was planning to address all the performance optimisations @Dandandan suggested.

@alamb
Copy link
Contributor

alamb commented Mar 24, 2022

Thank you @msathis -- makes total sense. Let us know if you need any help!

}
ColumnarValue::Scalar(scalar) => {
if !scalar.is_null() {
// TODO: Figure out how to set value at index
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb, I need your help here. I couldn't figure out how to set value to particular index of arrow::Array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may want to call scalar. to_array_of_size(size) here to create the appropriate array:

https://github.com/alamb/arrow-datafusion/blob/41b4e49/datafusion/common/src/scalar.rs#L1075-L1076

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb I think that worked.

@liukun4515
Copy link
Contributor

@msathis If the pr is ready to review, please at me.
Thanks.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @msathis -- this is a great PR. I think it could be merged as is, though I have some small comments / suggestions.

I also updated this PR description as it said "everything was treated as a string" which I do not think applies any longer

"| coalesce(test.c1,test.c2) |",
"+---------------------------+",
"| 0 |",
"| 1 |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

#[tokio::test]
async fn coalesce_static_empty_value() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test seems to be the same as coalesce_plan -- is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its an overlook, you're right. i'll remove coalesce_plan

}

#[tokio::test]
async fn coalesce_static_value_with_null() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

#[tokio::test]
async fn coalesce_result_with_default_value() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stylistically this might be better if it were right next to coalesce_result given its similarity.

For that matter, since it uses the same data setup, it might be cool to combine the two tests (and run two different queries in them)

"+---------------------------------------------+",
];
assert_batches_eq!(expected, &actual);
Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice testing 👍

datafusion/expr/src/expr_fn.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Apr 2, 2022

I will wait for @liukun4515 to review as well prior to merging

@msathis
Copy link
Contributor Author

msathis commented Apr 2, 2022

@alamb Updated the PR to address the review comments

@msathis
Copy link
Contributor Author

msathis commented Apr 2, 2022

@msathis If the pr is ready to review, please at me. Thanks.

@liukun4515 Your review will be great! I think the PR is ready. My knowledge is limited, i feel more optimisation is possible.

@alamb alamb changed the title Coalesce function Add Coalesce function Apr 3, 2022
@alamb
Copy link
Contributor

alamb commented Apr 3, 2022

Thanks @msathis -- I think this is looking great

@alamb
Copy link
Contributor

alamb commented Apr 5, 2022

@liukun4515 would you still like to review this prior to merging?

Copy link
Member

@yjshen yjshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @msathis!

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yjshen yjshen merged commit b890190 into apache:master Apr 6, 2022
ovr pushed a commit to cube-js/arrow-datafusion that referenced this pull request Aug 10, 2022
* feat: coalesce function. WIP

* fix: add license header

* fix: support all primitive data types

* fix: add test case and remove commented code

* fix: refactor code and support coalesce from ballista

* fix: support scalar values in coalesce

* fix: clippy

* fix: clippy by removing unneeded mut

* fix: clippy by removing unneeded mut

* fix: address review comments

Co-authored-by: Sathis Kumar <[email protected]>
ovr pushed a commit to cube-js/arrow-datafusion that referenced this pull request Aug 10, 2022
* feat: coalesce function. WIP

* fix: add license header

* fix: support all primitive data types

* fix: add test case and remove commented code

* fix: refactor code and support coalesce from ballista

* fix: support scalar values in coalesce

* fix: clippy

* fix: clippy by removing unneeded mut

* fix: clippy by removing unneeded mut

* fix: address review comments

Co-authored-by: Sathis Kumar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for SQL coalesce function
6 participants