-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-42247: [C++] Support casting to and from utf8_view/binary_view #43302
Conversation
BinaryViewType::kPrefixSize); | ||
// out_view.ref.buffer_index = 0; | ||
out_view.ref.offset = static_cast<int32_t>(data_offset); | ||
// TODO(felipecrv): validate data_offsets can't overflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a TODO for this PR? Otherwise, perhaps create a GH issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapleFU this one needs to be fixed with the same fix I added in line 477 // Check against offset overflow
. I forgot that there were two places with this TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! memset
to 0 really handle some tricky problem in protocol layer, thanks for your effort!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, just a few nits
7d5a377
to
7d1f438
Compare
InitializeUTF8(); | ||
ArraySpanVisitor<I> visitor; | ||
Utf8Validator validator; | ||
RETURN_NOT_OK(visitor.Visit(input, &validator)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Unrelated to this pr: What reminds me is the utf8 checking in arrow-rs, maybe we can use same algorithm? apache/arrow-rs#6009 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good [Parquet] issue to open
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think I see what you mean: we could similarly assemble larger contiguous byte ranges on which we run a single Utf8 validation pass.
For the common case of views whose out-of-line data directly follows the previous out-of-line bytes, this would yield one long byte range for Utf8 validation.
Inline strings would also always be valid Utf8 since their size would consist of 3 zero bytes and one small byte plus the inline data and padding zero bytes, so we could validate on runs of inline views too.
02d7467
to
2416d19
Compare
@github-actions crossbow submit -g cpp |
Revision: 2416d19 Submitted crossbow builds: ursacomputing/crossbow @ actions-aaf4a45dfc |
CI failures are unrelated, I'll merge |
Thanks! |
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 85fc3eb. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 97 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…ew (apache#43302) ### Rationale for this change We need casts between string (binary) and string-view (binary-view) types since they are semantically equivalent. ### What changes are included in this PR? - Add `is_binary_view_like()` type predicate - Add `BinaryViewTypes()` list including `STRING_VIEW/BINARY_VIEW` - New cast kernels ### Are these changes tested? Yes, but test coverage might be improved. ### Are there any user-facing changes? More casts are available. * GitHub Issue: apache#42247 Lead-authored-by: Felipe Oliveira Carvalho <[email protected]> Co-authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
We need casts between string (binary) and string-view (binary-view) types since they are semantically equivalent.
What changes are included in this PR?
is_binary_view_like()
type predicateBinaryViewTypes()
list includingSTRING_VIEW/BINARY_VIEW
Are these changes tested?
Yes, but test coverage might be improved.
Are there any user-facing changes?
More casts are available.