Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide data_chunk_source wrapper for datasource #11886

Merged

Conversation

upsj
Copy link
Contributor

@upsj upsj commented Oct 10, 2022

Description

With datasource being more generic in its interface than data_chunk_source, this PR adds a wrapper that wraps a datasource in a data_chunk_source for use in multibyte_split. Its host read implementation is based on the file data_chunk_source

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@upsj upsj added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 10, 2022
@upsj upsj requested a review from vuule October 10, 2022 15:16
@upsj upsj requested a review from a team as a code owner October 10, 2022 15:16
@upsj upsj self-assigned this Oct 10, 2022
@upsj upsj requested a review from karthikeyann October 10, 2022 15:16
@codecov
Copy link

codecov bot commented Oct 10, 2022

Codecov Report

Base: 87.40% // Head: 86.90% // Decreases project coverage by -0.49% ⚠️

Coverage data is based on head (b302e1a) compared to base (f72c4ce).
Patch has no changes to coverable lines.

❗ Current head b302e1a differs from pull request most recent head a267ce1. Consider uploading reports for the commit a267ce1 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11886      +/-   ##
================================================
- Coverage         87.40%   86.90%   -0.50%     
================================================
  Files               133      133              
  Lines             21833    21977     +144     
================================================
+ Hits              19084    19100      +16     
- Misses             2749     2877     +128     
Impacted Files Coverage Δ
python/strings_udf/strings_udf/lowering.py 0.00% <0.00%> (-84.40%) ⬇️
python/strings_udf/strings_udf/_typing.py 81.05% <0.00%> (-14.74%) ⬇️
python/strings_udf/strings_udf/__init__.py 84.31% <0.00%> (-12.57%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 82.20% <0.00%> (-3.35%) ⬇️
python/cudf/cudf/core/column/lists.py 92.78% <0.00%> (-0.97%) ⬇️
python/cudf/cudf/testing/dataset_generator.py 72.83% <0.00%> (-0.42%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
python/dask_cudf/dask_cudf/backends.py 84.90% <0.00%> (-0.37%) ⬇️
python/cudf/cudf/core/tools/datetimes.py 84.19% <0.00%> (-0.31%) ⬇️
... and 23 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

cpp/tests/io/text/data_chunk_source_test.cpp Show resolved Hide resolved
cpp/tests/io/text/data_chunk_source_test.cpp Outdated Show resolved Hide resolved

if (_source->supports_device_read() && _source->is_device_read_preferred(read_size)) {
_source->device_read_async(
_offset, read_size, reinterpret_cast<uint8_t*>(chunk.data()), stream);
Copy link
Contributor

@karthikeyann karthikeyann Oct 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not in the scope of this PR. Should we unify the code use common type for source->device_read/host_read and device_uvector_data_chunk ? Right now, one uses uint8_t*, the other uses char*.
asking @vuule for his comments too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe char* is used because it makes sense for text input. datasource does not make this assumption.
I would like to use std::byte for such "untyped" buffers. The problem is that std::byte is hard to use for anything that's not a bitwise operation. We could also templatize IO APIs to avoid casting the returned buffers. IMO it would be good to discuss this at some point (maybe a meeting with @upsj is here?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, zlib's deflate routine also takes unsigned char as input, while the iostream API uses char

* @return the data chunk source for the provided datasource. It must not outlive the datasource
* used to construct it.
*/
std::unique_ptr<data_chunk_source> make_source(datasource& data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document when this overload should be used? AFAIK, using datasource is often much slower than directly using a istream_data_chunk_reader.

Copy link
Contributor Author

@upsj upsj Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure - datasource can use kvikio/cuFile, which we don't yet use in the data_chunk_source. Should I add it to the benchmark to put some numbers to this question?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, yes I should

T size_approx GPU Time Encoded file size
file 2^10 = 1024 44.450 ms 999.000 B
file 2^12 = 4096 44.404 ms 3.944 KiB
file 2^14 = 16384 44.271 ms 15.718 KiB
file 2^16 = 65536 44.338 ms 63.220 KiB
file 2^18 = 262144 44.442 ms 251.584 KiB
file 2^20 = 1048576 44.780 ms 1004.521 KiB
file 2^22 = 4194304 45.677 ms 3.927 MiB
file 2^24 = 16777216 49.339 ms 15.726 MiB
file 2^26 = 67108864 68.670 ms 62.926 MiB
file 2^28 = 268435456 126.198 ms 251.709 MiB
file 2^30 = 1073741824 357.786 ms 1006.638 MiB
file_datasource 2^10 = 1024 1.063 ms 999.000 B
file_datasource 2^12 = 4096 1.063 ms 3.944 KiB
file_datasource 2^14 = 16384 1.072 ms 15.718 KiB
file_datasource 2^16 = 65536 1.088 ms 63.220 KiB
file_datasource 2^18 = 262144 1.147 ms 251.584 KiB
file_datasource 2^20 = 1048576 1.587 ms 1004.521 KiB
file_datasource 2^22 = 4194304 3.879 ms 3.927 MiB
file_datasource 2^24 = 16777216 15.695 ms 15.726 MiB
file_datasource 2^26 = 67108864 65.186 ms 62.926 MiB
file_datasource 2^28 = 268435456 110.538 ms 251.709 MiB
file_datasource 2^30 = 1073741824 288.055 ms 1006.638 MiB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

man, that thread took an unexpected turn :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity: This is the difference between an mmapped file and ifstream

@upsj upsj requested a review from karthikeyann October 26, 2022 11:33
@upsj upsj added cuIO cuIO issue 4 - Needs cuIO Reviewer and removed 3 - Ready for Review Ready for review by team labels Oct 26, 2022
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions and small nits

cpp/benchmarks/io/text/multibyte_split.cpp Show resolved Hide resolved
cpp/src/io/text/data_chunk_source_factories.cpp Outdated Show resolved Hide resolved
cpp/src/io/text/data_chunk_source_factories.cpp Outdated Show resolved Hide resolved
@upsj
Copy link
Contributor Author

upsj commented Oct 27, 2022

rerun tests

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@upsj upsj added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuIO Reviewer labels Oct 27, 2022
@vuule
Copy link
Contributor

vuule commented Oct 27, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1b1ca7c into rapidsai:branch-22.12 Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants