Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide
data_chunk_source
wrapper fordatasource
#11886Provide
data_chunk_source
wrapper fordatasource
#11886Changes from all commits
6863c8e
16a38d5
6090626
a267ce1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document when this overload should be used? AFAIK, using
datasource
is often much slower than directly using aistream_data_chunk_reader
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure - datasource can use kvikio/cuFile, which we don't yet use in the data_chunk_source. Should I add it to the benchmark to put some numbers to this question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, yes I should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
man, that thread took an unexpected turn :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity: This is the difference between an
mmap
ped file andifstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not in the scope of this PR. Should we unify the code use common type for source->device_read/host_read and
device_uvector_data_chunk
? Right now, one usesuint8_t*
, the other useschar*
.asking @vuule for his comments too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe
char*
is used because it makes sense for text input.datasource
does not make this assumption.I would like to use
std::byte
for such "untyped" buffers. The problem is thatstd::byte
is hard to use for anything that's not a bitwise operation. We could also templatize IO APIs to avoid casting the returned buffers. IMO it would be good to discuss this at some point (maybe a meeting with @upsj is here?).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, zlib's deflate routine also takes
unsigned char
as input, while the iostream API useschar