Provide `data_chunk_source` wrapper for `datasource` #11886

upsj · 2022-10-10T15:16:06Z

Description

With datasource being more generic in its interface than data_chunk_source, this PR adds a wrapper that wraps a datasource in a data_chunk_source for use in multibyte_split. Its host read implementation is based on the file data_chunk_source

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

codecov · 2022-10-10T17:32:36Z

Codecov Report

Base: 87.40% // Head: 86.90% // Decreases project coverage by -0.49% ⚠️

Coverage data is based on head (b302e1a) compared to base (f72c4ce).
Patch has no changes to coverable lines.

❗ Current head b302e1a differs from pull request most recent head a267ce1. Consider uploading reports for the commit a267ce1 to get more accurate results

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #11886      +/-   ##
================================================
- Coverage         87.40%   86.90%   -0.50%     
================================================
  Files               133      133              
  Lines             21833    21977     +144     
================================================
+ Hits              19084    19100      +16     
- Misses             2749     2877     +128

Impacted Files	Coverage Δ
python/strings_udf/strings_udf/lowering.py	`0.00% <0.00%> (-84.40%)`	⬇️
python/strings_udf/strings_udf/_typing.py	`81.05% <0.00%> (-14.74%)`	⬇️
python/strings_udf/strings_udf/__init__.py	`84.31% <0.00%> (-12.57%)`	⬇️
python/cudf/cudf/io/text.py	`91.66% <0.00%> (-8.34%)`	⬇️
python/cudf/cudf/core/_base_index.py	`82.20% <0.00%> (-3.35%)`	⬇️
python/cudf/cudf/core/column/lists.py	`92.78% <0.00%> (-0.97%)`	⬇️
python/cudf/cudf/testing/dataset_generator.py	`72.83% <0.00%> (-0.42%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`73.72% <0.00%> (-0.41%)`	⬇️
python/dask_cudf/dask_cudf/backends.py	`84.90% <0.00%> (-0.37%)`	⬇️
python/cudf/cudf/core/tools/datetimes.py	`84.19% <0.00%> (-0.31%)`	⬇️
... and 23 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

cpp/tests/io/text/data_chunk_source_test.cpp

karthikeyann · 2022-10-15T03:16:20Z

cpp/src/io/text/data_chunk_source_factories.cpp

+
+    if (_source->supports_device_read() && _source->is_device_read_preferred(read_size)) {
+      _source->device_read_async(
+        _offset, read_size, reinterpret_cast<uint8_t*>(chunk.data()), stream);


This is not in the scope of this PR. Should we unify the code use common type for source->device_read/host_read and device_uvector_data_chunk ? Right now, one uses uint8_t*, the other uses char*.
asking @vuule for his comments too.

I believe char* is used because it makes sense for text input. datasource does not make this assumption.
I would like to use std::byte for such "untyped" buffers. The problem is that std::byte is hard to use for anything that's not a bitwise operation. We could also templatize IO APIs to avoid casting the returned buffers. IMO it would be good to discuss this at some point (maybe a meeting with @upsj is here?).

FWIW, zlib's deflate routine also takes unsigned char as input, while the iostream API uses char

vuule · 2022-10-15T08:01:50Z

cpp/include/cudf/io/text/data_chunk_source_factories.hpp

+ * @return the data chunk source for the provided datasource. It must not outlive the datasource
+ *         used to construct it.
+ */
+std::unique_ptr<data_chunk_source> make_source(datasource& data);


Should we document when this overload should be used? AFAIK, using datasource is often much slower than directly using a istream_data_chunk_reader.

Not sure - datasource can use kvikio/cuFile, which we don't yet use in the data_chunk_source. Should I add it to the benchmark to put some numbers to this question?

Yes, yes I should

T size_approx GPU Time Encoded file size

file 2^10 = 1024 44.450 ms 999.000 B

file 2^12 = 4096 44.404 ms 3.944 KiB

file 2^14 = 16384 44.271 ms 15.718 KiB

file 2^16 = 65536 44.338 ms 63.220 KiB

file 2^18 = 262144 44.442 ms 251.584 KiB

file 2^20 = 1048576 44.780 ms 1004.521 KiB

file 2^22 = 4194304 45.677 ms 3.927 MiB

file 2^24 = 16777216 49.339 ms 15.726 MiB

file 2^26 = 67108864 68.670 ms 62.926 MiB

file 2^28 = 268435456 126.198 ms 251.709 MiB

file 2^30 = 1073741824 357.786 ms 1006.638 MiB

file_datasource 2^10 = 1024 1.063 ms 999.000 B

file_datasource 2^12 = 4096 1.063 ms 3.944 KiB

file_datasource 2^14 = 16384 1.072 ms 15.718 KiB

file_datasource 2^16 = 65536 1.088 ms 63.220 KiB

file_datasource 2^18 = 262144 1.147 ms 251.584 KiB

file_datasource 2^20 = 1048576 1.587 ms 1004.521 KiB

file_datasource 2^22 = 4194304 3.879 ms 3.927 MiB

file_datasource 2^24 = 16777216 15.695 ms 15.726 MiB

file_datasource 2^26 = 67108864 65.186 ms 62.926 MiB

file_datasource 2^28 = 268435456 110.538 ms 251.709 MiB

file_datasource 2^30 = 1073741824 288.055 ms 1006.638 MiB

man, that thread took an unexpected turn :D

For posterity: This is the difference between an mmapped file and ifstream

hyperbolic2346

A few questions and small nits

cpp/benchmarks/io/text/multibyte_split.cpp

cpp/src/io/text/data_chunk_source_factories.cpp

upsj · 2022-10-27T17:28:29Z

rerun tests

karthikeyann

LGTM 👍

vuule · 2022-10-27T20:20:05Z

@gpucibot merge

provide data_chunk_source wrapper for datasource

6863c8e

upsj added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 10, 2022

upsj requested a review from vuule October 10, 2022 15:16

upsj requested a review from a team as a code owner October 10, 2022 15:16

upsj self-assigned this Oct 10, 2022

upsj requested a review from karthikeyann October 10, 2022 15:16

karthikeyann requested changes Oct 15, 2022

View reviewed changes

vuule reviewed Oct 15, 2022

View reviewed changes

improve test content generation

16a38d5

upsj requested a review from karthikeyann October 26, 2022 11:33

add datasource to multibyte_split benchmark

6090626

upsj added cuIO cuIO issue 4 - Needs cuIO Reviewer and removed 3 - Ready for Review Ready for review by team labels Oct 26, 2022

hyperbolic2346 requested changes Oct 26, 2022

View reviewed changes

cpp/benchmarks/io/text/multibyte_split.cpp Show resolved Hide resolved

cpp/src/io/text/data_chunk_source_factories.cpp Outdated Show resolved Hide resolved

cpp/src/io/text/data_chunk_source_factories.cpp Outdated Show resolved Hide resolved

store ticket count as constexpr

a267ce1

hyperbolic2346 approved these changes Oct 27, 2022

View reviewed changes

karthikeyann approved these changes Oct 27, 2022

View reviewed changes

upsj added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuIO Reviewer labels Oct 27, 2022

rapids-bot bot merged commit 1b1ca7c into rapidsai:branch-22.12 Oct 27, 2022

karthikeyann mentioned this pull request Nov 16, 2022

byte_range support for JSON Lines format #12017

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide `data_chunk_source` wrapper for `datasource` #11886

Provide `data_chunk_source` wrapper for `datasource` #11886

upsj commented Oct 10, 2022

codecov bot commented Oct 10, 2022 •

edited

Loading

karthikeyann Oct 15, 2022 •

edited

Loading

vuule Oct 15, 2022

upsj Oct 15, 2022

vuule Oct 15, 2022

upsj Oct 26, 2022 •

edited

Loading

upsj Oct 26, 2022

vuule Oct 27, 2022

upsj Oct 27, 2022

hyperbolic2346 left a comment

upsj commented Oct 27, 2022

karthikeyann left a comment

vuule commented Oct 27, 2022

T	size_approx	GPU Time	Encoded file size
file	2^10 = 1024	44.450 ms	999.000 B
file	2^12 = 4096	44.404 ms	3.944 KiB
file	2^14 = 16384	44.271 ms	15.718 KiB
file	2^16 = 65536	44.338 ms	63.220 KiB
file	2^18 = 262144	44.442 ms	251.584 KiB
file	2^20 = 1048576	44.780 ms	1004.521 KiB
file	2^22 = 4194304	45.677 ms	3.927 MiB
file	2^24 = 16777216	49.339 ms	15.726 MiB
file	2^26 = 67108864	68.670 ms	62.926 MiB
file	2^28 = 268435456	126.198 ms	251.709 MiB
file	2^30 = 1073741824	357.786 ms	1006.638 MiB
file_datasource	2^10 = 1024	1.063 ms	999.000 B
file_datasource	2^12 = 4096	1.063 ms	3.944 KiB
file_datasource	2^14 = 16384	1.072 ms	15.718 KiB
file_datasource	2^16 = 65536	1.088 ms	63.220 KiB
file_datasource	2^18 = 262144	1.147 ms	251.584 KiB
file_datasource	2^20 = 1048576	1.587 ms	1004.521 KiB
file_datasource	2^22 = 4194304	3.879 ms	3.927 MiB
file_datasource	2^24 = 16777216	15.695 ms	15.726 MiB
file_datasource	2^26 = 67108864	65.186 ms	62.926 MiB
file_datasource	2^28 = 268435456	110.538 ms	251.709 MiB
file_datasource	2^30 = 1073741824	288.055 ms	1006.638 MiB

Provide data_chunk_source wrapper for datasource #11886

Provide data_chunk_source wrapper for datasource #11886

Conversation

upsj commented Oct 10, 2022

Description

Checklist

codecov bot commented Oct 10, 2022 • edited Loading

Codecov Report

karthikeyann Oct 15, 2022 • edited Loading

Choose a reason for hiding this comment

vuule Oct 15, 2022

Choose a reason for hiding this comment

upsj Oct 15, 2022

Choose a reason for hiding this comment

vuule Oct 15, 2022

Choose a reason for hiding this comment

upsj Oct 26, 2022 • edited Loading

Choose a reason for hiding this comment

upsj Oct 26, 2022

Choose a reason for hiding this comment

vuule Oct 27, 2022

Choose a reason for hiding this comment

upsj Oct 27, 2022

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

upsj commented Oct 27, 2022

karthikeyann left a comment

Choose a reason for hiding this comment

vuule commented Oct 27, 2022

Provide `data_chunk_source` wrapper for `datasource` #11886

Provide `data_chunk_source` wrapper for `datasource` #11886

codecov bot commented Oct 10, 2022 •

edited

Loading

karthikeyann Oct 15, 2022 •

edited

Loading

upsj Oct 26, 2022 •

edited

Loading