Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ScalarValue::{List, LargeList, FixedSizedList} to take specific types rather than ArrayRef #8562

Merged
merged 18 commits into from
Jan 9, 2024

Conversation

rspears74
Copy link

@rspears74 rspears74 commented Dec 16, 2023

Which issue does this PR close?

Closes #8472

Rationale for this change

It's much preferable to constrain ScalarValue::List to accept only an Arc<ListArray> rather than any ArrayRef and fail randomly. Similar for ScalarValue::LargeList and ScalarValue::FixedSizeList.

What changes are included in this PR?

Changed the type signature of ScalarValue::List, ScalarValue::LargeList, and ScalarValue::FixedSizedList and any cascading changes that follow from this. One test compared a ListArray passed to ScalarValue::List to a LargeListArray passed to ScalarValue::List - because ScalarValue::List only accepts ListArray now, this test was changed to compare two ListArrays instead.

Are these changes tested?

Tests have been updated where needed.

Are there any user-facing changes?

Users are now only allowed to pass a ListArray to ScalarValue::List, a LargeListArray to ScalarValue::LargeList, and a FixedSizeListArray to ScalarValue::FixedSizeList.

Spears Randall added 2 commits December 15, 2023 16:38
Also ScalarValue::LargeList and ScalarValue::FixedSizeList
@github-actions github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules labels Dec 16, 2023
@@ -3300,33 +3409,6 @@ mod tests {
assert_eq!(result, &expected);
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ScalarValue::eq_array test for List elsewhere? Otherwise, we can add the test for it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it because it was comparing a ListArray with a LargeListArray, and thus was no longer compatible with ScalarValue::List. But I can add this test back and just do two ListArrays?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

datafusion/proto/src/logical_plan/to_proto.rs Outdated Show resolved Hide resolved
datafusion/proto/src/logical_plan/to_proto.rs Outdated Show resolved Hide resolved
datafusion/proto/src/logical_plan/to_proto.rs Show resolved Hide resolved
@alamb alamb changed the title ScalarValue::List type narrowing Change ScalarValue::{List, LargeList, FixedSizedList} to take specific types rather than ArrayRef Dec 17, 2023
@alamb alamb added the api change Changes the API exposed to users of the crate label Dec 17, 2023
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jayzhan211
Copy link
Contributor

@alamb It seems you forget to trace this PR in #8704?

Copy link
Member

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @rspears74 👍

write!(f, "{value_formatter}")?
}
ScalarValue::LargeList(arr) => {
// ScalarValue List should always have a single element
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ScalarValue List should always have a single element
// ScalarValue LargeList should always have a single element

let value_formatter = formatter.value(0);
write!(f, "{value_formatter}")?
}
ScalarValue::FixedSizeList(arr) => {
// ScalarValue List should always have a single element
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ScalarValue List should always have a single element
// ScalarValue FixedSizeList should always have a single element

arr.to_owned() as ArrayRef,
)])
.map_err(|e| {
Error::General( format!("Error creating temporary batch while encoding ScalarValue::List: {e}"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Error::General( format!("Error creating temporary batch while encoding ScalarValue::List: {e}"))
Error::General( format!("Error creating temporary batch while encoding ScalarValue::LargeList: {e}"))

arr.to_owned() as ArrayRef,
)])
.map_err(|e| {
Error::General( format!("Error creating temporary batch while encoding ScalarValue::List: {e}"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Error::General( format!("Error creating temporary batch while encoding ScalarValue::List: {e}"))
Error::General( format!("Error creating temporary batch while encoding ScalarValue::FixedSizeList: {e}"))

@alamb
Copy link
Contributor

alamb commented Jan 4, 2024

I will try and find time to review this PR later this week . Sorry for the delay

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @rspears74 - I think it is looking quite close. There are a few places where code gets repeated 3 times that I think we can avoid by refactoring code into specific functions. I left some more specific suggestions below.

Also thank you @jayzhan211 and @Weijun-H for the reviews

)),
})
}
ScalarValue::LargeList(arr) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to avoid the copy/paste in this method by making a function? Something like the following?

fn encode_scalar_list_value(array: &Array) -> Result<protobuf::ScalarListValue> {
...
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was easy enough to implement.

arr.value(0)
} else {
unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen")
fn first_array_for_list(arr: &Arc<ListArray>) -> ArrayRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really nice here to avoid the redundant copy/paste code, perhaps by making a function with signature like this

fn cmp_list_array(arr1: &dyn Array, arr2: &dyn Array) -> Option<Ordering> {
..
}

Which then you can call with Arc<ListArray> and Arc<LargeListArray> types

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a little tougher, because if the args to the function are "generic" (i.e. ArrayRef instead of a specific Arc<ListArray> type), the arr1.value(0) method is not available. The first_array_for_list function needs to be able to convert the ArrayRef into the specific type to then get the value out. So I'm struggling a bit coming up with a way here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a proposal of how to do it: rspears74#1 (that targets this PR)

@rspears74
Copy link
Author

rspears74 commented Jan 6, 2024

Welp, I had a merge conflict, and upon fixing it, there's some new code that's incompatible with these changes (not compiling), so I'm gonna have to do a little more involved work to fix those. Will try to make that happen ASAP.

@rspears74
Copy link
Author

Welp, I had a merge conflict, and upon fixing it, there's some new code that's incompatible with these changes (not compiling), so I'm gonna have to do a little more involved work to fix those. Will try to make that happen ASAP.

Alright that wasn't so bad. Fixed and passing tests.

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rspears74 -- I think this is looking good.

cc @wjones127 since I think you are interested in FixedSizedList in general

I think there is still some repetition that could be avoided (I left my suggestions on how to do so) but we could potentially do that as a follow on PR as well

arr.value(0)
} else {
unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen")
fn first_array_for_list(arr: &Arc<ListArray>) -> ArrayRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a proposal of how to do it: rspears74#1 (that targets this PR)

// ScalarValue List should always have a single element
assert_eq!(arr.len(), 1);
let options = FormatOptions::default().with_display_error(true);
let formatter = ArrayFormatter::try_new(arr, &options).unwrap();
let formatter =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same pattern could be applied here as in rspears74#1 (use first_array_for_list)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of this one too. I went through my changes looking for any larger duplication sections but missed this one.

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems acceptable. Like Andrew said I think we could avoid some repetition.

I do think it's worth quickly considering whether wrapping in Scalar might be useful, as right now there isn't any strong enforcement that the array values in these scalars is length 1.

datafusion/common/src/scalar.rs Show resolved Hide resolved
datafusion/common/src/scalar.rs Show resolved Hide resolved
datafusion/common/src/scalar.rs Show resolved Hide resolved
@@ -142,13 +143,13 @@ pub enum ScalarValue {
/// Fixed size list scalar.
///
/// The array must be a FixedSizeListArray with length 1.
FixedSizeList(ArrayRef),
FixedSizeList(Arc<FixedSizeListArray>),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I'm coming in a little late, but have we considered instead making the type Scalar<Arc<FixedSizeListArray>>? (Scalar being the one in arrow-rs) This would enforce that the length is 1, which might be nice. I could also understand if that makes it too awkward to work with.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I like the enforcement of length 1 (I mentioned this problem initially in the issue I opened), but it does seem to be a bit cumbersome to deal with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file a follow on ticket for continuing improving the representation of list constants

@rspears74
Copy link
Author

Not sure why the MSRV check is failing?

@alamb
Copy link
Contributor

alamb commented Jan 7, 2024

Not sure why the MSRV check is failing?

You may need to merge up from main -- I think this was fixed on main recently

@alamb
Copy link
Contributor

alamb commented Jan 9, 2024

I merged up from main to resolve a merge conflict. Thank you everyone

@alamb alamb merged commit f8d8603 into apache:main Jan 9, 2024
22 checks passed
@rspears74
Copy link
Author

Thanks @alamb @jayzhan211 @wjones127 for working through this with me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ScalarValue::List not working as expected in UDAF state
5 participants