-
Notifications
You must be signed in to change notification settings - Fork 849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ArrayFormatter in Cast Kernel #3668
Conversation
131d030
to
10c9b5c
Compare
10c9b5c
to
2de1fb5
Compare
/// Helper function to cast from `GenericBinaryArray` to `GenericStringArray`. This function performs | ||
/// UTF8 validation during casting. For invalid UTF8 value, it could be Null or returning `Err` depending | ||
/// `CastOptions`. | ||
fn cast_binary_to_generic_string<I, O>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this by converting the offsets and data separately, the performance is the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some questions, but overall it is pretty sweet to see the performance improve with less code 👍
@@ -155,13 +154,12 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool { | |||
(_, Boolean) => DataType::is_numeric(from_type) || from_type == &Utf8 || from_type == &LargeUtf8, | |||
(Boolean, _) => DataType::is_numeric(to_type) || to_type == &Utf8 || to_type == &LargeUtf8, | |||
|
|||
(Utf8, LargeUtf8) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for anyone else reviewing, these case got folded down into the other Utf8 and LargeUtf cases
@@ -182,11 +181,8 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool { | |||
| Time64(TimeUnit::Nanosecond) | |||
| Timestamp(TimeUnit::Nanosecond, None) | |||
) => true, | |||
(LargeUtf8, _) => DataType::is_numeric(to_type) && to_type != &Float16, | |||
(Timestamp(_, _), Utf8) | (Timestamp(_, _), LargeUtf8) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for this change? Is the Timestamp nanosecond with no timezone seems to be covered by the cases above, but timestamp with other units and Non null timezone is not (obviously) covered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're rolled into
(_, Utf8 | LargeUtf8) => from_type.is_primitive(),
A few lines down
array: &dyn Array, | ||
) -> Result<ArrayRef, ArrowError> { | ||
let mut builder = GenericStringBuilder::<O>::new(); | ||
let options = FormatOptions::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually we could even pass in FormatOptions
to the cast kernel (definitely not this PR)!
@@ -5521,8 +5237,8 @@ mod tests { | |||
let b = cast(&array, &DataType::Utf8).unwrap(); | |||
let c = b.as_any().downcast_ref::<StringArray>().unwrap(); | |||
assert_eq!(&DataType::Utf8, c.data_type()); | |||
assert_eq!("1997-05-19 00:00:00", c.value(0)); | |||
assert_eq!("2018-12-25 00:00:00", c.value(1)); | |||
assert_eq!("1997-05-19T00:00:00", c.value(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did these change? This seems like a non trivial change, potentially, for downstream crates, right? Is there some way to get the old behavior back (like maybe passing FormatOptions to the cast kernel) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They change to be RFC3339, given we've made the same change to CSV, JSON and pretty output and this is the last holdout I figured I'd just change it
Co-authored-by: Andrew Lamb <[email protected]>
Benchmark runs are scheduled for baseline = a3b344d and contender = 0b8c003. 0b8c003 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #.
Rationale for this change
Using the ArrayFormatter is less code, more consistent, and is ~10% faster as it avoids intermediate string allocations
What changes are included in this PR?
Are there any user-facing changes?
This changes the formatting of Date64 from
1997-05-19 00:00:00
to1997-05-19T00:00:00