-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle multibyte_split
byte_range out-of-bounds offsets on host
#11885
Handle multibyte_split
byte_range out-of-bounds offsets on host
#11885
Conversation
Codecov ReportBase: 87.40% // Head: 88.11% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11885 +/- ##
================================================
+ Coverage 87.40% 88.11% +0.70%
================================================
Files 133 133
Lines 21833 21881 +48
================================================
+ Hits 19084 19281 +197
+ Misses 2749 2600 -149
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure how I feel about changing something well defined like cutoff_offset
for int64_t
. I wonder if we shouldn't change it to a using like using cutoff_offset_type = int64_t
.
@hyperbolic2346 To give some context, I only introduced the cutoff_offset to handle byte ranges on the device side, but that turned out to be a little problematic to generalize for other parsers. In this PR, the offsets (which are offsets into the offset array, not offsets into the string) are no longer being cut off after the byte range, so I didn't want to use that name any more. |
I can understand that. My concern comes from how easy it is to notice a lack of |
Do you have a suggestion for a suitable name? I used the tongue-in-cheek name |
Everything I think of is either in use or isn't fitting. Offset of some sort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple small suggestions, otherwise LGTM.
Co-authored-by: Bradley Dice <[email protected]>
using output_offset = int64_t; | ||
using byte_offset = int64_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this? If you want to completely differentiate between these types, simple type aliasing like this is not enough. Instead, strong types should be prefered:
enum class output_offset : int64_t {};
enum class byte_offset : int64_t {};
Using strong types may be more difficult to do arithmetic operations but you can always cast the values to int64_t
when doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this approach has its shortcomings, I mainly wanted to make clear what different offsets point to, more avoiding raw integer types for readability than true type safety. I need to to enough arithmetic on the types that using an enum class would be really inconvenient. Something like BOOST_STRONG_TYPEDEF
would probably fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both approaches have pros and cons. For this case, I think a simple using
is sufficient, with a weakly aliased type. It does pose some benefit to readability, and the need for arithmetic operators makes it difficult to use a strong type like an enum class
. I'm happy with the tradeoffs that @upsj took in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: @ttnghia we need strong types for the left/right comparator indices (the last time I remember discussing this topic) to ensure that function signatures match the given types. That isn't as much of a concern here because the weakly aliased types here aren't being passed to functions with multiple overloads for plain integers and strong types.
@gpucibot merge |
Description
In order to uniformize the interface for a future combined handling of byte ranges between read_csv and read_text, this PR replaces the
cutoff_offset
by a plain integer again, and handles finding the first out-of-bounds on the host side instead.Checklist