byte_range support for multibyte_split/read_text #10150

cwharris · 2022-01-27T19:25:30Z

Adding byte_range support to multibyte_split/read_text.

providing a byte range in terms of (offset, size) allows multibyte_split to read a whole file, but only return the offsets within those ranges as well as one additional offset (unless it's the end of the file). If thinking in terms of "records", where each delimiter dictates the end of a record, we effectively return all records which begin within the byte range provided, and ignore all other records, including any record which may end (but not begin) within the range, and including any record which may begin in the range but end outside of the range.

examples:

input: "abc..def..ghi..jkl.."
delimiter: ..

range offset: 0
range size: 2
output: ["abc.."]

range offset: 2
range size: 9
output: ["def..", "ghi.."]

range offset: 11
range size: 2
output: []

range offset: 13
range size: 7
output: ["jkl..", ""]

…it-byte_range

cpp/include/cudf/io/text/multibyte_split.hpp

robertmaynard

CMake changes LGTM

cpp/include/cudf/io/text/data_chunk_source.hpp

cpp/include/cudf/io/text/byte_range_info.hpp

cpp/include/cudf/io/text/data_chunk_source_factories.hpp

cpp/tests/io/text/multibyte_split_test.cpp

python/cudf/cudf/tests/test_text.py

cpp/include/cudf/io/text/byte_range_info.hpp

…it-byte_range

vyasr

A few outstanding items, but nothing blocking.

cpp/src/io/text/byte_range_info.cpp

vyasr · 2022-03-01T18:39:59Z

cpp/src/io/text/multibyte_split.cu

+                      return static_cast<int32_t>(offset - relevant_offset_first);
+                    });
+
+  stream.synchronize();


I'm confused, did you mean to delete this?

python/cudf/cudf/tests/test_text.py

…it-byte_range

cwharris · 2022-03-01T19:10:23Z

@gpucibot merge

@cwharris

#10150 broke compiler support for GCC 11 (built locally) because it was missing `#include <optional>` in a couple files. This fixes it. cc: @cwharris Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Karthikeyan (https://github.com/karthikeyann) URL: #10385

This uses the `byte_range` argument added to read_text with #10150 to create a dask version of `read_text` Authors: - https://github.com/ChrisJar Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #10407

cwharris added 11 commits January 12, 2022 15:48

byte_range support for multibyte_split, almost working

a220545

multibyte_split byte_range support

53a6416

Merge branch 'branch-22.02' of github.com:rapidsai/cudf into io-mbspl…

5f7435c

…it-byte_range

multibyte_split byte_range support in python read_text api

b7022d9

fix read_text when byte_range=None

5142593

Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…

8fa93bf

…it-byte_range

add a test for read_text byte range

03ff0c4

Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…

258259d

…it-byte_range

fix read_text byte range test

5610855

simplify read_text byte_range python test

769864d

Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…

73823f6

…it-byte_range

cwharris requested review from a team as code owners January 27, 2022 19:25

cwharris requested review from karthikeyann, davidwendt, galipremsagar and charlesbluca January 27, 2022 19:25

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jan 27, 2022

cwharris assigned randerzander and unassigned randerzander Jan 27, 2022

cwharris requested a review from davidwendt February 24, 2022 16:55

cwharris added 3 commits February 24, 2022 11:16

fix formatting issues

3f6f27c

Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…

b9e3e66

…it-byte_range

Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…

ba71127

…it-byte_range

davidwendt reviewed Feb 28, 2022

View reviewed changes

cpp/include/cudf/io/text/multibyte_split.hpp Outdated Show resolved Hide resolved

cwharris added 2 commits February 28, 2022 09:34

clearify multibyte_split comment and add examples

ee6932e

fix gramatically error

2b945a0

cwharris requested a review from davidwendt February 28, 2022 15:36

add doxygen wrapper around pseudo code

ef0fa38

davidwendt approved these changes Feb 28, 2022

View reviewed changes

robertmaynard approved these changes Feb 28, 2022

View reviewed changes

vyasr requested changes Feb 28, 2022

View reviewed changes

vyasr reviewed Feb 28, 2022

View reviewed changes

cpp/include/cudf/io/text/byte_range_info.hpp Outdated Show resolved Hide resolved

cwharris added 3 commits February 28, 2022 14:29

make byte_range_info a class, address pr comments

cbab111

Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…

0d32bbe

…it-byte_range

address pr feedback

0f52185

cwharris requested a review from vyasr February 28, 2022 21:01

vyasr approved these changes Mar 1, 2022

View reviewed changes

cwharris added 2 commits March 1, 2022 12:47

remove unnecessary synchronize, use emplace_back instead of push_back

ada0013

Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…

74647f2

…it-byte_range

rapids-bot bot merged commit 78b316c into rapidsai:branch-22.04 Mar 1, 2022

This was referenced Mar 1, 2022

Fix floating point data generation in benchmarks #10372

Merged

Include <optional> in multibyte split. #10385

Merged

ChrisJar mentioned this pull request Mar 10, 2022

Enable read_text with dask_cudf using byte_range #10407

Merged

shwina mentioned this pull request Mar 23, 2022

[DOC] RAPIDS 22.04 Release Blog Outline #10383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

byte_range support for multibyte_split/read_text #10150

byte_range support for multibyte_split/read_text #10150

cwharris commented Jan 27, 2022

robertmaynard left a comment

vyasr left a comment

vyasr Mar 1, 2022

cwharris commented Mar 1, 2022

byte_range support for multibyte_split/read_text #10150

byte_range support for multibyte_split/read_text #10150

Conversation

cwharris commented Jan 27, 2022

robertmaynard left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

vyasr Mar 1, 2022

Choose a reason for hiding this comment

cwharris commented Mar 1, 2022