Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

byte_range support for multibyte_split/read_text #10150

Merged
merged 45 commits into from
Mar 1, 2022
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
a220545
byte_range support for multibyte_split, almost working
cwharris Jan 12, 2022
53a6416
multibyte_split byte_range support
cwharris Jan 13, 2022
5f7435c
Merge branch 'branch-22.02' of github.com:rapidsai/cudf into io-mbspl…
cwharris Jan 13, 2022
b7022d9
multibyte_split byte_range support in python read_text api
cwharris Jan 15, 2022
5142593
fix read_text when byte_range=None
cwharris Jan 19, 2022
8fa93bf
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Jan 26, 2022
03ff0c4
add a test for read_text byte range
cwharris Jan 26, 2022
258259d
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Jan 26, 2022
5610855
fix read_text byte range test
cwharris Jan 26, 2022
769864d
simplify read_text byte_range python test
cwharris Jan 27, 2022
73823f6
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Jan 27, 2022
4166dc3
add large file support for multibyte_split
cwharris Jan 31, 2022
b1c3581
add temporary multibyte_split test case for very large files
cwharris Feb 2, 2022
1d67a01
fix memory leak in multibyte_split
cwharris Feb 2, 2022
f165587
remove faulty `read_to` function from data_chunk_reader
cwharris Feb 2, 2022
513d585
fix broken multibyte_split test
cwharris Feb 2, 2022
991c3ce
fix data_chunk_reader memory leak in a better way
cwharris Feb 2, 2022
400fc52
remove unused byte_range arguments from multibyte_split kernel
cwharris Feb 2, 2022
0bbc7cc
clean up multibyte_split, remove std::out usages
cwharris Feb 3, 2022
09b9492
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Feb 8, 2022
13564ad
fix style issues
cwharris Feb 8, 2022
cd51c6e
move byte_range_info static function definitions out of header
cwharris Feb 8, 2022
08b1bab
fix formatting issues
cwharris Feb 9, 2022
b4313a0
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Feb 9, 2022
36e7d26
fix styles
cwharris Feb 9, 2022
d23413d
use =default destructor instead of explicit empty destructor
cwharris Feb 17, 2022
addc5ed
fix copyright year
cwharris Feb 22, 2022
ce9d4a7
fix flake8 errors
cwharris Feb 22, 2022
03742ff
fix flake8 errors
cwharris Feb 22, 2022
55c0e47
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Feb 23, 2022
e4ee1d3
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Feb 24, 2022
e5a352b
address various pr style comments
cwharris Feb 24, 2022
caa2ed2
add comment to multibyte_split and fix off-by-one error in related to…
cwharris Feb 24, 2022
351760c
remove std::cout call
cwharris Feb 24, 2022
3f6f27c
fix formatting issues
cwharris Feb 24, 2022
b9e3e66
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Feb 25, 2022
ba71127
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Feb 28, 2022
ee6932e
clearify multibyte_split comment and add examples
cwharris Feb 28, 2022
2b945a0
fix gramatically error
cwharris Feb 28, 2022
ef0fa38
add doxygen wrapper around pseudo code
cwharris Feb 28, 2022
cbab111
make byte_range_info a class, address pr comments
cwharris Feb 28, 2022
0d32bbe
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Feb 28, 2022
0f52185
address pr feedback
cwharris Feb 28, 2022
ada0013
remove unnecessary synchronize, use emplace_back instead of push_back
cwharris Mar 1, 2022
74647f2
Merge branch 'branch-22.04' of github.com:rapidsai/cudf into io-mbspl…
cwharris Mar 1, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ test:
- test -f $PREFIX/include/cudf/io/orc_metadata.hpp
- test -f $PREFIX/include/cudf/io/orc.hpp
- test -f $PREFIX/include/cudf/io/parquet.hpp
- test -f $PREFIX/include/cudf/io/text/byte_range_info.hpp
- test -f $PREFIX/include/cudf/io/text/data_chunk_source_factories.hpp
- test -f $PREFIX/include/cudf/io/text/data_chunk_source.hpp
- test -f $PREFIX/include/cudf/io/text/detail/multistate.hpp
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,7 @@ add_library(
src/io/parquet/writer_impl.cu
src/io/statistics/orc_column_statistics.cu
src/io/statistics/parquet_column_statistics.cu
src/io/text/byte_range_info.cpp
src/io/text/multibyte_split.cu
src/io/utilities/column_buffer.cpp
src/io/utilities/config_utils.cpp
Expand Down
53 changes: 53 additions & 0 deletions cpp/include/cudf/io/text/byte_range_info.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cstdint>
#include <vector>

namespace cudf {
namespace io {
namespace text {

/**
* @brief stores offset and size used to indicate a byte range
*/
struct byte_range_info {
cwharris marked this conversation as resolved.
Show resolved Hide resolved
public:
cwharris marked this conversation as resolved.
Show resolved Hide resolved
int64_t offset;
int64_t size;

byte_range_info() : offset(0), size(0) {}
byte_range_info(int64_t offset, int64_t size) : offset(offset), size(size) {}
cwharris marked this conversation as resolved.
Show resolved Hide resolved

static byte_range_info whole_source();
cwharris marked this conversation as resolved.
Show resolved Hide resolved
/**
* @brief Create a collection of consecutive ranges between [0, total_bytes).
*
* Each range wil be the same size except if `total_bytes` is not evenly divisible by
* `range_count`, in which case the last range size will be the remainder.
*
* @param total_bytes total number of bytes in all ranges
* @param range_count total number of ranges in which to divide bytes
* @return Vector of range objects
*/
static std::vector<byte_range_info> create_consecutive(int64_t total_bytes, int64_t range_count);
cwharris marked this conversation as resolved.
Show resolved Hide resolved
};

} // namespace text
} // namespace io
} // namespace cudf
7 changes: 6 additions & 1 deletion cpp/include/cudf/io/text/data_chunk_source.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
* Copyright (c) 2022, NVIDIA CORPORATION.
cwharris marked this conversation as resolved.
Show resolved Hide resolved
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -36,6 +36,7 @@ namespace text {
*/
class device_data_chunk {
public:
virtual ~device_data_chunk() = default;
[[nodiscard]] virtual char const* data() const = 0;
[[nodiscard]] virtual std::size_t size() const = 0;
virtual operator device_span<char const>() const = 0;
Expand All @@ -52,6 +53,9 @@ class device_data_chunk {
*/
class data_chunk_reader {
public:
virtual ~data_chunk_reader() = default;
virtual void skip_bytes(std::size_t size) = 0;

/**
* @brief Get the next chunk of bytes from the data source
*
Expand All @@ -76,6 +80,7 @@ class data_chunk_reader {
*/
class data_chunk_source {
public:
virtual ~data_chunk_source() = default;
[[nodiscard]] virtual std::unique_ptr<data_chunk_reader> create_reader() const = 0;
};

Expand Down
10 changes: 9 additions & 1 deletion cpp/include/cudf/io/text/data_chunk_source_factories.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
* Copyright (c) 2022, NVIDIA CORPORATION.
cwharris marked this conversation as resolved.
Show resolved Hide resolved
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -89,6 +89,8 @@ class istream_data_chunk_reader : public data_chunk_reader {
}
}

void skip_bytes(std::size_t size) override { _datastream->ignore(size); };

std::unique_ptr<device_data_chunk> get_next_chunk(std::size_t read_size,
rmm::cuda_stream_view stream) override
{
Expand Down Expand Up @@ -143,6 +145,12 @@ class device_span_data_chunk_reader : public data_chunk_reader {
public:
device_span_data_chunk_reader(device_span<char const> data) : _data(data) {}

void skip_bytes(std::size_t read_size) override
{
if (read_size > _data.size() - _position) { read_size = _data.size() - _position; }
_position += read_size;
cwharris marked this conversation as resolved.
Show resolved Hide resolved
};

std::unique_ptr<device_data_chunk> get_next_chunk(std::size_t read_size,
rmm::cuda_stream_view stream) override
{
Expand Down
16 changes: 1 addition & 15 deletions cpp/include/cudf/io/text/detail/trie.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
* Copyright (c) 2022, NVIDIA CORPORATION.
cwharris marked this conversation as resolved.
Show resolved Hide resolved
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -89,20 +89,6 @@ struct trie_device_view {
*/
constexpr uint8_t get_match_length(uint16_t idx) { return _nodes[idx].match_length; }

/**
* @brief returns the longest matching state of any state in the multistate.
*/
template <uint32_t N>
constexpr uint8_t get_match_length(multistate const& states)
cwharris marked this conversation as resolved.
Show resolved Hide resolved
{
int8_t val = 0;
for (uint8_t i = 0; i < states.size(); i++) {
auto match_length = get_match_length(states.get_tail(i));
if (match_length > val) { val = match_length; }
}
return val;
}

private:
constexpr void transition_enqueue_all( //
char c,
Expand Down
53 changes: 51 additions & 2 deletions cpp/include/cudf/io/text/multibyte_split.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
* Copyright (c) 2022, NVIDIA CORPORATION.
cwharris marked this conversation as resolved.
Show resolved Hide resolved
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -17,6 +17,7 @@
#pragma once

#include <cudf/column/column.hpp>
#include <cudf/io/text/byte_range_info.hpp>
#include <cudf/io/text/data_chunk_source.hpp>

#include <rmm/mr/device/device_memory_resource.hpp>
Expand All @@ -27,10 +28,58 @@ namespace cudf {
namespace io {
namespace text {

/**
* @brief Splits the source text in to a strings column using a multiple byte delimiter.
cwharris marked this conversation as resolved.
Show resolved Hide resolved
*
* Providing a byte range allows multibyte_split to read a whole file, but only return the offsets
* of delimiters which begin within the range. If thinking in terms of "records", where each
* delimiter dictates the end of a record, all records which begin within the byte range
cwharris marked this conversation as resolved.
Show resolved Hide resolved
* provided will be returned, including any record which may begin in the range but end outside of
* the range. Records which begin outside of the range will ignored, even if those records end
* inside the range.
*
* @code{.pseudo}
* Examples:
* source: "abc..def..ghi..jkl.."
* delimiter: ".."
*
* byte_range: nullopt
* return: ["abc..", "def..", "ghi..", jkl..", ""]
*
* byte_range: [0, 2)
* return: ["abc.."]
*
* byte_range: [2, 9)
* return: ["def..", "ghi.."]
*
* byte_range: [11, 2)
* return: []
*
* byte_range: [13, 7)
* return: ["jkl..", ""]
* @endcode
*
* @param source The source string
* @param delimiter UTF-8 encoded string for which to find offsets in the source
* @param byte_range range in which to consider offsets relevant
* @param mr Memory resource to use for the device memory allocation
* @return The strings found by splitting the source by the delimiter within the relevant byte
* range.
*/
std::unique_ptr<cudf::column> multibyte_split(
data_chunk_source const& source,
std::string const& delimiter,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
std::optional<byte_range_info> byte_range = std::nullopt,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

std::unique_ptr<cudf::column> multibyte_split(data_chunk_source const& source,
std::string const& delimiter,
byte_range_info byte_range,
rmm::mr::device_memory_resource* mr);

std::unique_ptr<cudf::column> multibyte_split(data_chunk_source const& source,
std::string const& delimiter,
rmm::mr::device_memory_resource* mr);

} // namespace text
} // namespace io
Expand Down
46 changes: 46 additions & 0 deletions cpp/src/io/text/byte_range_info.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cudf/detail/utilities/integer_utils.hpp>
#include <cudf/io/text/byte_range_info.hpp>

#include <limits>

namespace cudf {
namespace io {
namespace text {

byte_range_info byte_range_info::whole_source() { return {0, std::numeric_limits<int64_t>::max()}; }

std::vector<byte_range_info> byte_range_info::create_consecutive(int64_t total_bytes,
int64_t range_count)
{
auto range_size = util::div_rounding_up_safe(total_bytes, range_count);

std::vector<byte_range_info> ranges;

for (int64_t i = 0; i < range_count; i++) {
auto offset = i * range_size;
auto size = std::min(range_size, total_bytes - offset);
ranges.push_back(byte_range_info{offset, size});
cwharris marked this conversation as resolved.
Show resolved Hide resolved
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider reserving the vector size then using emplace_back. And while you're at it, maybe use std::for_each?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added reserve(). What's the advantage of using std::for_each here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None really, just following the general propensity of the cudf team to prefer algorithms over raw loops. I don't personally see a reason to change this and only suggested it since you'd be changing the code anyway..


return ranges;
}

} // namespace text
} // namespace io
} // namespace cudf
Loading