Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ArrayFormatter in Cast Kernel #3668

Merged
merged 4 commits into from
Feb 9, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Feb 6, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

Using the ArrayFormatter is less code, more consistent, and is ~10% faster as it avoids intermediate string allocations

cast i64 to string 512  time:   [10.723 µs 10.728 µs 10.733 µs]
                        change: [-11.208% -11.131% -11.052%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

cast f32 to string 512  time:   [16.210 µs 16.222 µs 16.233 µs]
                        change: [-7.7917% -7.6542% -7.5083%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

What changes are included in this PR?

Are there any user-facing changes?

This changes the formatting of Date64 from 1997-05-19 00:00:00 to 1997-05-19T00:00:00

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Feb 6, 2023
@tustvold tustvold force-pushed the use-array-formatter-cast branch from 131d030 to 10c9b5c Compare February 6, 2023 13:37
@tustvold tustvold force-pushed the use-array-formatter-cast branch from 10c9b5c to 2de1fb5 Compare February 8, 2023 17:00
@github-actions github-actions bot removed the parquet Changes to the parquet crate label Feb 8, 2023
@tustvold tustvold added the api-change Changes to the arrow API label Feb 8, 2023
@tustvold tustvold marked this pull request as ready for review February 8, 2023 17:01
@tustvold tustvold marked this pull request as draft February 8, 2023 17:28
/// Helper function to cast from `GenericBinaryArray` to `GenericStringArray`. This function performs
/// UTF8 validation during casting. For invalid UTF8 value, it could be Null or returning `Err` depending
/// `CastOptions`.
fn cast_binary_to_generic_string<I, O>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this by converting the offsets and data separately, the performance is the same

@tustvold tustvold marked this pull request as ready for review February 8, 2023 18:33
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some questions, but overall it is pretty sweet to see the performance improve with less code 👍

@@ -155,13 +154,12 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
(_, Boolean) => DataType::is_numeric(from_type) || from_type == &Utf8 || from_type == &LargeUtf8,
(Boolean, _) => DataType::is_numeric(to_type) || to_type == &Utf8 || to_type == &LargeUtf8,

(Utf8, LargeUtf8) => true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for anyone else reviewing, these case got folded down into the other Utf8 and LargeUtf cases

@@ -182,11 +181,8 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
| Time64(TimeUnit::Nanosecond)
| Timestamp(TimeUnit::Nanosecond, None)
) => true,
(LargeUtf8, _) => DataType::is_numeric(to_type) && to_type != &Float16,
(Timestamp(_, _), Utf8) | (Timestamp(_, _), LargeUtf8) => true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for this change? Is the Timestamp nanosecond with no timezone seems to be covered by the cases above, but timestamp with other units and Non null timezone is not (obviously) covered

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're rolled into

        (_, Utf8 | LargeUtf8) => from_type.is_primitive(),

A few lines down

array: &dyn Array,
) -> Result<ArrayRef, ArrowError> {
let mut builder = GenericStringBuilder::<O>::new();
let options = FormatOptions::default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually we could even pass in FormatOptions to the cast kernel (definitely not this PR)!

arrow-cast/src/cast.rs Show resolved Hide resolved
@@ -5521,8 +5237,8 @@ mod tests {
let b = cast(&array, &DataType::Utf8).unwrap();
let c = b.as_any().downcast_ref::<StringArray>().unwrap();
assert_eq!(&DataType::Utf8, c.data_type());
assert_eq!("1997-05-19 00:00:00", c.value(0));
assert_eq!("2018-12-25 00:00:00", c.value(1));
assert_eq!("1997-05-19T00:00:00", c.value(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did these change? This seems like a non trivial change, potentially, for downstream crates, right? Is there some way to get the old behavior back (like maybe passing FormatOptions to the cast kernel) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They change to be RFC3339, given we've made the same change to CSV, JSON and pretty output and this is the last holdout I figured I'd just change it

@tustvold tustvold merged commit 0b8c003 into apache:master Feb 9, 2023
@ursabot
Copy link

ursabot commented Feb 9, 2023

Benchmark runs are scheduled for baseline = a3b344d and contender = 0b8c003. 0b8c003 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants