-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide data_chunk_source
wrapper for datasource
#11886
Provide data_chunk_source
wrapper for datasource
#11886
Conversation
Codecov ReportBase: 87.40% // Head: 86.90% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11886 +/- ##
================================================
- Coverage 87.40% 86.90% -0.50%
================================================
Files 133 133
Lines 21833 21977 +144
================================================
+ Hits 19084 19100 +16
- Misses 2749 2877 +128
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
||
if (_source->supports_device_read() && _source->is_device_read_preferred(read_size)) { | ||
_source->device_read_async( | ||
_offset, read_size, reinterpret_cast<uint8_t*>(chunk.data()), stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not in the scope of this PR. Should we unify the code use common type for source->device_read/host_read and device_uvector_data_chunk
? Right now, one uses uint8_t*
, the other uses char*
.
asking @vuule for his comments too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe char*
is used because it makes sense for text input. datasource
does not make this assumption.
I would like to use std::byte
for such "untyped" buffers. The problem is that std::byte
is hard to use for anything that's not a bitwise operation. We could also templatize IO APIs to avoid casting the returned buffers. IMO it would be good to discuss this at some point (maybe a meeting with @upsj is here?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, zlib's deflate routine also takes unsigned char
as input, while the iostream API uses char
* @return the data chunk source for the provided datasource. It must not outlive the datasource | ||
* used to construct it. | ||
*/ | ||
std::unique_ptr<data_chunk_source> make_source(datasource& data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document when this overload should be used? AFAIK, using datasource
is often much slower than directly using a istream_data_chunk_reader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure - datasource can use kvikio/cuFile, which we don't yet use in the data_chunk_source. Should I add it to the benchmark to put some numbers to this question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, yes I should
T | size_approx | GPU Time | Encoded file size |
---|---|---|---|
file | 2^10 = 1024 | 44.450 ms | 999.000 B |
file | 2^12 = 4096 | 44.404 ms | 3.944 KiB |
file | 2^14 = 16384 | 44.271 ms | 15.718 KiB |
file | 2^16 = 65536 | 44.338 ms | 63.220 KiB |
file | 2^18 = 262144 | 44.442 ms | 251.584 KiB |
file | 2^20 = 1048576 | 44.780 ms | 1004.521 KiB |
file | 2^22 = 4194304 | 45.677 ms | 3.927 MiB |
file | 2^24 = 16777216 | 49.339 ms | 15.726 MiB |
file | 2^26 = 67108864 | 68.670 ms | 62.926 MiB |
file | 2^28 = 268435456 | 126.198 ms | 251.709 MiB |
file | 2^30 = 1073741824 | 357.786 ms | 1006.638 MiB |
file_datasource | 2^10 = 1024 | 1.063 ms | 999.000 B |
file_datasource | 2^12 = 4096 | 1.063 ms | 3.944 KiB |
file_datasource | 2^14 = 16384 | 1.072 ms | 15.718 KiB |
file_datasource | 2^16 = 65536 | 1.088 ms | 63.220 KiB |
file_datasource | 2^18 = 262144 | 1.147 ms | 251.584 KiB |
file_datasource | 2^20 = 1048576 | 1.587 ms | 1004.521 KiB |
file_datasource | 2^22 = 4194304 | 3.879 ms | 3.927 MiB |
file_datasource | 2^24 = 16777216 | 15.695 ms | 15.726 MiB |
file_datasource | 2^26 = 67108864 | 65.186 ms | 62.926 MiB |
file_datasource | 2^28 = 268435456 | 110.538 ms | 251.709 MiB |
file_datasource | 2^30 = 1073741824 | 288.055 ms | 1006.638 MiB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
man, that thread took an unexpected turn :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For posterity: This is the difference between an mmap
ped file and ifstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions and small nits
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
@gpucibot merge |
Description
With
datasource
being more generic in its interface thandata_chunk_source
, this PR adds a wrapper that wraps adatasource
in adata_chunk_source
for use inmultibyte_split
. Its host read implementation is based on the filedata_chunk_source
Checklist