-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Casting generic binary to generic string #3607
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could do this in two passes
- Convert offsets
- Convert to string
This will be significantly faster than the approach here as it doesn't copy any string data and performs UTF-8 validation in a single pass (not to mention less code)
arrow-cast/src/cast.rs
Outdated
.iter() | ||
.map(|maybe_value| match maybe_value { | ||
Some(value) => { | ||
let result = std::str::from_utf8(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance of this will be pretty poor, it would be better to cast the offsets and then use the optimised routine for UTF-8 validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be significantly faster than the approach here as it doesn't copy any string data and performs UTF-8 validation in a single pass (not to mention less code)
This can only be applied if CastOptions.safe
as false. If CastOptions.safe
as true, we still need to iterate and validate each value (because it will be null instead returning Err
directly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I keep forgetting about that option, it seems to be some thing we keep for legacy compatibility
23acc63
to
39aacbe
Compare
39aacbe
to
124205b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I still wonder if there is some way to avoid re-encoding the offsets when not necessary - e.g. the common case of Binary to Utf8. Perhaps by separating offset conversion from value conversion, possibly something for a future PR
let offsets = array.value_offsets(); | ||
let values = array.value_data(); | ||
|
||
// We only need to validate that all values are valid UTF-8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a shame to duplicate this logic, but I guess it is hard to avoid whilst having this type signature
Benchmark runs are scheduled for baseline = eeecbe5 and contender = 8cc8327. 8cc8327 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3606.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?