Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add casting of count to Int64 in array_repeat function to ensure consistent integer type handling #14236

Merged
merged 3 commits into from
Jan 24, 2025

Conversation

jatin510
Copy link
Contributor

Which issue does this PR close?

Closes #14228.

Rationale for this change

The count in array_repeat function only works with Int64 data type. Now, we have upcasted the count variable if the dataype is Int8, Int16 and Int32.

What changes are included in this PR?

Updated the implementation of array_repeat to upcast datatype of count argument.

Are these changes tested?

Yes, Added test in array.slt file.

Are there any user-facing changes?

No

Copy link
Contributor

@korowa korowa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jatin510, LGTM.

I've left some comments on more possible enhancements here.

_ => return exec_err!("count must be an integer type"),
};

let count_array = as_int64_array(&count_array)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe uint should be used as a target type here? Right now DF panics when negative value passed as an argument, which also could be fixed along with type casting

Copy link
Contributor Author

@jatin510 jatin510 Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want the target type of array_repeat count function to be UInt64, we have to change the behaviour of array_repeat. As of now, by default count is of type Int64.
It will be a breaking change, are we okay with that ?

@korowa

with current implementation of array_repeat and default data type of Int64.

It is behaving like this , if we are converting all Signed Integer type to Unsigned Integer, but letting the input of count to be of Int64

> select array_repeat(1,-20000);
+--------------------------------------+
| array_repeat(Int64(1),Int64(-20000)) |
+--------------------------------------+
| []                                   |
+--------------------------------------+

In Spark, it gives empty arrow for negative count argument

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By target type I've meant conversion to UInt internallly instead of Int, so it wouldn't affect any input arguments, only prevent this function from panicking on negative values (but I thought it would return cast error 🤔 ), so I don't think it's a breaking change.

let count_array = &args[1];

let count_array = match count_array.data_type() {
DataType::Int8 | DataType::Int16 | DataType::Int32 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely, UInt should also be allowed here

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thansk for working on this @jatin510 and @korowa -- another way to implement this same functionality would using the signature (in which case DataFusion will handle the casting for you)

It seems like the current method simply accepts any argument types

fn signature(&self) -> &Signature {

I think you could potentially use a signature
https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.Signature.html

Or maybe you could implement https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.coerce_types directly

(and simply say the second argument must be an Int64 or UInt64)

🤔

@jatin510
Copy link
Contributor Author

Thansk for working on this @jatin510 and @korowa -- another way to implement this same functionality would using the signature (in which case DataFusion will handle the casting for you)

It seems like the current method simply accepts any argument types

fn signature(&self) -> &Signature {

I think you could potentially use a signature https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.Signature.html

Or maybe you could implement https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.coerce_types directly

(and simply say the second argument must be an Int64 or UInt64)

🤔

It makes sense.
I will update the signature of the array_repeat to handle casting and type conversion.


// Coerce the second argument to Int64/UInt64 if it's a numeric type
let second = match count_type {
DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => {
Copy link
Contributor Author

@jatin510 jatin510 Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@korowa @alamb
converting negative numbers to UInt64 in this function was giving: arrow error, can't cast negative numbers to Uint.
during runtime for the negative integers.

So, converted the Int values to Int64 .

Then using array_repeat inner function to convert Int64 to UInt64 type

Copy link
Contributor

@korowa korowa Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-necessary suggestion: due to safe cast converting negative values to nulls, perhaps we should also add a simple test to verify that count array with multiple values with nulls (after casting to uint) will be processed as expected, like

select array_repeat('x', column1) from (values (-1), (2), (-3));

@korowa korowa merged commit 633eef6 into apache:main Jan 24, 2025
25 checks passed
@korowa
Copy link
Contributor

korowa commented Jan 24, 2025

Thank you @jatin510 @alamb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Array repeat, upcast issue
3 participants