-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
byte_range support for multibyte_split/read_text #10150
byte_range support for multibyte_split/read_text #10150
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few outstanding items, but nothing blocking.
cpp/src/io/text/multibyte_split.cu
Outdated
return static_cast<int32_t>(offset - relevant_offset_first); | ||
}); | ||
|
||
stream.synchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, did you mean to delete this?
@gpucibot merge |
#10150 broke compiler support for GCC 11 (built locally) because it was missing `#include <optional>` in a couple files. This fixes it. cc: @cwharris Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Karthikeyan (https://github.com/karthikeyann) URL: #10385
This uses the `byte_range` argument added to read_text with #10150 to create a dask version of `read_text` Authors: - https://github.com/ChrisJar Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #10407
Adding byte_range support to multibyte_split/read_text.
Closes #9655
providing a byte range in terms of
(offset, size)
allows multibyte_split to read a whole file, but only return the offsets within those ranges as well as one additional offset (unless it's the end of the file). If thinking in terms of "records", where each delimiter dictates the end of a record, we effectively return all records which begin within the byte range provided, and ignore all other records, including any record which may end (but not begin) within the range, and including any record which may begin in the range but end outside of the range.examples: