Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multibyte_split #8702

Merged
merged 93 commits into from
Aug 24, 2021
Merged
Show file tree
Hide file tree
Changes from 87 commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
e1b71e6
multibyte-split scaffolding
cwharris Jun 27, 2021
836773a
cudf::io::text::input_stream
cwharris Jun 27, 2021
3e06c18
trie test scaffolding
cwharris Jul 2, 2021
ac14dbd
superstate + tests
cwharris Jul 7, 2021
ea8cee2
added device trie
cwharris Jul 7, 2021
a4a8dd0
add superstate to multibyte_split
cwharris Jul 7, 2021
094d2d2
cub block scan superstates
cwharris Jul 8, 2021
1117ab8
block-wide superstate matching
cwharris Jul 9, 2021
51b1444
fix superstate constructor bug where only the first 8 states were ini…
cwharris Jul 9, 2021
d1f7eb3
multibyte_split multiple delimeter support
cwharris Jul 9, 2021
a628d73
scan output-offsets in multibyte_split
cwharris Jul 9, 2021
e1cc84d
printf offsets in multibyte_split
cwharris Jul 9, 2021
c7177bc
add match-length to trie to adjust for output offset in multibyte_split
cwharris Jul 9, 2021
42dc014
adjust multibyte_split test case to expect delimiters to be retained …
cwharris Jul 9, 2021
5171711
printf match_begin and match_end for multibyte_split
cwharris Jul 9, 2021
6b62ceb
multibyte_split test passing
cwharris Jul 10, 2021
a2c9756
add multibyte_split comments, break test intentionally to work on mul…
cwharris Jul 12, 2021
21b8b25
multibyte_split add multi-block support
cwharris Jul 13, 2021
f59a93e
rename BYTES_PER_TILE to ITEMS_PER_TILE
cwharris Jul 13, 2021
5fa112a
add bounds check to multibyte_split load and flag
cwharris Jul 14, 2021
cf42fd0
multibyte_split benchmark scaffolding
cwharris Jul 14, 2021
e6e9741
multibyte_split increase threads per block and adjust test case.
cwharris Jul 14, 2021
b5c2e05
use circular buffer in multibyte_split to allow for stream inputs
cwharris Jul 16, 2021
738af48
update multibyte_split to work with streaming inputs
cwharris Jul 16, 2021
0121b22
consolidate two passes of stream-scanning to a single function
cwharris Jul 16, 2021
a233ca2
add tile_state partial to multibyte_split but dont use yet
cwharris Jul 16, 2021
4946058
add reusable tilestate callback to `multibyte_split`
cwharris Jul 16, 2021
d69aeca
begin working on warp-reduce window aggregation of tile state in mult…
cwharris Jul 16, 2021
079d1ea
fix multibyte_split bug where non-streaming approach would hang
cwharris Jul 17, 2021
970aac2
interleaved streaming io for multibyte_split
cwharris Jul 18, 2021
fee7ebb
use no-copy string column construction in multibyte_split
cwharris Jul 19, 2021
e5a5204
document multibyte_split minimum tile count requirements
cwharris Jul 19, 2021
216d620
Merge branch 'branch-21.10' into multibyte-split
cwharris Jul 19, 2021
65af4de
multibyte_split tunable concurrency via stream pool
cwharris Jul 22, 2021
a4fe128
multibyte_split remove device_istream replace with data_chunk_reader
cwharris Jul 23, 2021
9bc6c89
add data_chunk_source factories, nvtx ranges to multibyte_split, use …
cwharris Jul 23, 2021
08b3069
use make_device_uvector_async in trie.hpp
cwharris Jul 23, 2021
7088791
rm device_istream
cwharris Jul 23, 2021
b61c14f
multibyte_split add some docs, add more test cases
cwharris Jul 23, 2021
017f05d
revert CMakeLists ordering
cwharris Jul 23, 2021
f432e68
convert trie storage from SOA to AOS
cwharris Jul 25, 2021
59a70a9
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 26, 2021
f1d3b4a
fix spelling mistakes
cwharris Jul 26, 2021
51ac35c
break multibyte_split by adding queue/multistate support
cwharris Jul 27, 2021
3d04556
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 28, 2021
1fb36ee
fix `abac` pattern matching test, introduce new bug :(
cwharris Jul 29, 2021
ecf440a
fix multibyte_split aggregation strategy to avoid assuming T{} is an …
cwharris Jul 29, 2021
9e34efb
Merge branch 'multibyte-split-queue' into multibyte-split
cwharris Jul 29, 2021
fc014e5
add second host buffer to istream_data_chunk_reader to facilitate ove…
cwharris Jul 29, 2021
896ed31
actually add second buffer to istream_data_chunk_reader
cwharris Jul 29, 2021
7792521
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 30, 2021
2f75b50
clean up multibyte_split code
cwharris Jul 30, 2021
162e9cf
adjust copyright
cwharris Jul 30, 2021
ade1150
remove confusing test case in multibyte_split
cwharris Jul 30, 2021
8e08012
limit multibyte_split to 32 threads, because of a bug that needs fixi…
cwharris Jul 30, 2021
5ad2148
fix emoji bits documentation
cwharris Jul 31, 2021
511ab9f
style adjustments and documentation update to multibyte_split
cwharris Aug 2, 2021
69280e8
move tile-scanning utilites to detail namespace
cwharris Aug 2, 2021
2d37dc9
remove "inline" from constexpr members in cudf::io::text
cwharris Aug 2, 2021
9c6bf2a
fix large input bug in multibyte_split where offsets were not account…
cwharris Aug 3, 2021
ee817b1
improve data_chunk_reader docs
cwharris Aug 3, 2021
4cdbee5
make multibyte_split accept data_chunk_source as a const& arg
cwharris Aug 3, 2021
c3783db
add tile_state.hpp to meta.yaml
cwharris Aug 3, 2021
432399c
create bad-case scenario benchmark
cwharris Aug 3, 2021
ad21c4f
remove data_chunk in favor of device_span until it becomes clear an r…
cwharris Aug 4, 2021
18e0863
use std::vector<cuda_stream_view> instread of stream_pool
cwharris Aug 4, 2021
45e5b65
rename ticket to h_ticket
cwharris Aug 4, 2021
ee122a8
adjust `scan_tile_state_view::get_prefix` to make the purpose of thre…
cwharris Aug 4, 2021
c9d2889
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 5, 2021
ca6bbac
fix UB in multibyte_split concurrent kernel execution, improve perf
cwharris Aug 6, 2021
d68d951
add error messages to multibyte_split to indicate unsupported use cases
cwharris Aug 6, 2021
9684646
remove __threadfence() in favor of cuda::atomic
cwharris Aug 9, 2021
d3de062
improve multibyte_split benchmarks
cwharris Aug 13, 2021
d392140
provide explicit memory_order for tile state status stores.
cwharris Aug 13, 2021
42b8c88
improve multibyte_split benchmarks
cwharris Aug 13, 2021
b976525
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 13, 2021
d50f815
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 13, 2021
40d81e8
add file and host benchmarks for multibyte_split
cwharris Aug 14, 2021
3171339
make use of div_rounding_up_safe
cwharris Aug 14, 2021
63c4bb0
remove unused temp storage from tile state callback
cwharris Aug 14, 2021
eda265b
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 17, 2021
05cdecf
simplify multibyte_split api to accept only a single delimiter
cwharris Aug 17, 2021
a4d4d79
add strings column factory which takes device_uvectors
cwharris Aug 19, 2021
cef897d
add docs to cudf::io::text::detail::trie
cwharris Aug 19, 2021
097cadd
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 23, 2021
89ce0aa
add more documentation and comments to multibyte_split related code
cwharris Aug 23, 2021
d2735dd
adjust multibyte_split benchmark deviation math to be representative …
cwharris Aug 23, 2021
5a1e4d6
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 24, 2021
c15b5d2
Merge branch 'multibyte-split' of github.com:cwharris/cudf into multi…
cwharris Aug 24, 2021
615534d
multibyte_split: replace typedef with using and replace uint32_t with…
cwharris Aug 24, 2021
bd67026
make data_chunk_reader::get_next_chunk docs more informative.
cwharris Aug 24, 2021
a61fd09
fix style
cwharris Aug 24, 2021
b0d4135
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 24, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -118,19 +118,25 @@ test:
- test -f $PREFIX/include/cudf/hashing.hpp
- test -f $PREFIX/include/cudf/interop.hpp
- test -f $PREFIX/include/cudf/io/avro.hpp
- test -f $PREFIX/include/cudf/io/csv.hpp
- test -f $PREFIX/include/cudf/io/data_sink.hpp
- test -f $PREFIX/include/cudf/io/datasource.hpp
- test -f $PREFIX/include/cudf/io/orc_metadata.hpp
- test -f $PREFIX/include/cudf/io/csv.hpp
- test -f $PREFIX/include/cudf/io/detail/avro.hpp
- test -f $PREFIX/include/cudf/io/detail/csv.hpp
- test -f $PREFIX/include/cudf/io/detail/json.hpp
- test -f $PREFIX/include/cudf/io/detail/orc.hpp
- test -f $PREFIX/include/cudf/io/detail/parquet.hpp
- test -f $PREFIX/include/cudf/io/detail/utils.hpp
- test -f $PREFIX/include/cudf/io/json.hpp
- test -f $PREFIX/include/cudf/io/orc_metadata.hpp
- test -f $PREFIX/include/cudf/io/orc.hpp
- test -f $PREFIX/include/cudf/io/parquet.hpp
- test -f $PREFIX/include/cudf/io/text/data_chunk_source_factories.hpp
- test -f $PREFIX/include/cudf/io/text/data_chunk_source.hpp
- test -f $PREFIX/include/cudf/io/text/detail/multistate.hpp
- test -f $PREFIX/include/cudf/io/text/detail/tile_state.hpp
- test -f $PREFIX/include/cudf/io/text/detail/trie.hpp
- test -f $PREFIX/include/cudf/io/text/multibyte_split.hpp
- test -f $PREFIX/include/cudf/io/types.hpp
- test -f $PREFIX/include/cudf/ipc.hpp
- test -f $PREFIX/include/cudf/join.hpp
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,7 @@ add_library(cudf
src/io/parquet/writer_impl.cu
src/io/statistics/orc_column_statistics.cu
src/io/statistics/parquet_column_statistics.cu
src/io/text/multibyte_split.cu
src/io/utilities/column_buffer.cpp
src/io/utilities/data_sink.cpp
src/io/utilities/datasource.cpp
Expand Down
5 changes: 5 additions & 0 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -245,3 +245,8 @@ ConfigureBench(STRINGS_BENCH
# - json benchmark -------------------------------------------------------------------
ConfigureBench(JSON_BENCH
string/json_benchmark.cpp)

###################################################################################################
# - io benchmark ---------------------------------------------------------------------
ConfigureBench(MULTIBYTE_SPLIT_BENCHMARK
io/text/multibyte_split_benchmark.cpp)
2 changes: 2 additions & 0 deletions cpp/benchmarks/io/cuio_benchmark_common.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ using cudf::io::io_type;
benchmark(name##_buffer_output, type_or_group, static_cast<uint32_t>(io_type::HOST_BUFFER)); \
benchmark(name##_void_output, type_or_group, static_cast<uint32_t>(io_type::VOID));

std::string random_file_in_dir(std::string const& dir_path);

/**
* @brief Class to create a coupled `source_info` and `sink_info` of given type.
*/
Expand Down
164 changes: 164 additions & 0 deletions cpp/benchmarks/io/text/multibyte_split_benchmark.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <benchmarks/common/generate_benchmark_input.hpp>
#include <benchmarks/fixture/benchmark_fixture.hpp>
#include <benchmarks/io/cuio_benchmark_common.hpp>
#include <benchmarks/synchronization/synchronization.hpp>

#include <cudf_test/column_wrapper.hpp>

#include <cudf_test/file_utilities.hpp>

#include <cudf/io/text/data_chunk_source_factories.hpp>
#include <cudf/io/text/multibyte_split.hpp>
#include <cudf/scalar/scalar_factories.hpp>
#include <cudf/strings/combine.hpp>
#include <cudf/types.hpp>

#include <rmm/cuda_stream_view.hpp>

#include <thrust/transform.h>

#include <cstdio>
#include <fstream>
#include <memory>

using cudf::test::fixed_width_column_wrapper;

temp_directory const temp_dir("cudf_gbench");

enum data_chunk_source_type {
device,
file,
host,
};

static cudf::string_scalar create_random_input(int32_t num_chars,
double delim_factor,
double deviation,
std::string delim)
{
auto const num_delims = static_cast<int32_t>((num_chars * delim_factor) / delim.size());
auto const num_delim_chars = num_delims * delim.size();
auto const num_value_chars = num_chars - num_delim_chars;
auto const num_rows = num_delims;
auto const value_size_avg = static_cast<int32_t>(num_value_chars / num_rows);
auto const value_size_min = static_cast<int32_t>(value_size_avg * (1 - deviation));
auto const value_size_max = static_cast<int32_t>(value_size_avg * (1 + deviation));

data_profile table_profile;

table_profile.set_distribution_params( //
cudf::type_id::STRING,
distribution_id::NORMAL,
value_size_min,
value_size_max);

auto const values_table = create_random_table( //
{cudf::type_id::STRING},
1,
row_count{num_rows},
table_profile);

auto delim_scalar = cudf::make_string_scalar(delim);
auto delims_column = cudf::make_column_from_scalar(*delim_scalar, num_rows);
auto input_table = cudf::table_view({values_table->get_column(0).view(), delims_column->view()});
auto input_column = cudf::strings::concatenate(input_table);

// extract the chars from the returned strings column.
auto input_column_contents = input_column->release();
auto chars_column_contents = input_column_contents.children[1]->release();
auto chars_buffer = chars_column_contents.data.release();

// turn the chars in to a string scalar.
return cudf::string_scalar(std::move(*chars_buffer));
}

static void BM_multibyte_split(benchmark::State& state)
{
auto source_type = static_cast<data_chunk_source_type>(state.range(0));
auto delim_size = state.range(1);
auto delim_percent = state.range(2);
auto file_size_approx = state.range(3);

CUDF_EXPECTS(delim_percent >= 1, "delimiter percent must be at least 1");
CUDF_EXPECTS(delim_percent <= 50, "delimiter percent must be at most 50");

auto delim = std::string(":", delim_size);

auto delim_factor = static_cast<double>(delim_percent) / 100;
auto device_input = create_random_input(file_size_approx, delim_factor, 0.05, delim);
auto host_input = thrust::host_vector<char>(device_input.size());
auto host_string = std::string(host_input.data(), host_input.size());

cudaMemcpyAsync(host_input.data(),
device_input.data(),
device_input.size() * sizeof(char),
cudaMemcpyDeviceToHost,
rmm::cuda_stream_default);

auto temp_file_name = random_file_in_dir(temp_dir.path());

{
auto temp_fostream = std::ofstream(temp_file_name, std::ofstream::out);
temp_fostream.write(host_input.data(), host_input.size());
}

cudaDeviceSynchronize();

auto source = std::unique_ptr<cudf::io::text::data_chunk_source>(nullptr);

switch (source_type) {
case data_chunk_source_type::file: //
source = cudf::io::text::make_source_from_file(temp_file_name);
break;
case data_chunk_source_type::host: //
source = cudf::io::text::make_source(host_string);
break;
case data_chunk_source_type::device: //
source = cudf::io::text::make_source(device_input);
break;
default: CUDF_FAIL();
}

for (auto _ : state) {
cuda_event_timer raii(state, true);
auto output = cudf::io::text::multibyte_split(*source, delim);
}

state.SetBytesProcessed(state.iterations() * device_input.size());
}

class MultibyteSplitBenchmark : public cudf::benchmark {
};

#define TRANSPOSE_BM_BENCHMARK_DEFINE(name) \
BENCHMARK_DEFINE_F(MultibyteSplitBenchmark, name)(::benchmark::State & state) \
{ \
BM_multibyte_split(state); \
} \
BENCHMARK_REGISTER_F(MultibyteSplitBenchmark, name) \
->ArgsProduct({{data_chunk_source_type::device, \
data_chunk_source_type::file, \
data_chunk_source_type::host}, \
{1, 4, 7}, \
{1, 25}, \
{1 << 15, 1 << 30}}) \
->UseManualTime() \
->Unit(::benchmark::kMillisecond);

TRANSPOSE_BM_BENCHMARK_DEFINE(multibyte_split_simple);
20 changes: 20 additions & 0 deletions cpp/include/cudf/column/column_factories.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -442,6 +442,26 @@ std::unique_ptr<column> make_strings_column(
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Construct a STRING type column given offsets, columns, and optional null count and null
* mask.
*
* @param[in] num_strings The number of strings the column represents.
* @param[in] offsets The offset values for this column. The number of elements is one more than the
* total number of strings so the `offset[last] - offset[0]` is the total number of bytes in the
* strings vector.
* @param[in] chars The char bytes for all the strings for this column. Individual strings are
* identified by the offsets and the nullmask.
* @param[in] null_mask The bits specifying the null strings in device memory. Arrow format for
* nulls is used for interpreting this bitmask.
* @param[in] null_count The number of null string entries.
*/
std::unique_ptr<column> make_strings_column(size_type num_strings,
rmm::device_uvector<size_type>&& offsets,
rmm::device_uvector<char>&& chars,
rmm::device_buffer&& null_mask = {},
size_type null_count = cudf::UNKNOWN_NULL_COUNT);

/**
* @brief Construct a LIST type column given offsets column, child column, null mask and null
* count.
Expand Down
67 changes: 67 additions & 0 deletions cpp/include/cudf/io/text/data_chunk_source.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cudf/utilities/span.hpp>

#include <rmm/cuda_stream_pool.hpp>
#include <rmm/device_buffer.hpp>

namespace cudf {
namespace io {
namespace text {

/**
* @brief a reader capable of producing views over device memory.
*
* The data chunk reader API encapsulates the idea of statefully traversing and loading a data
* source. A data source may be a file, a region of device memory, or a region of host memory.
* Reading data from these data sources efficiently requires different strategies dependings on the
* type of data source, type of compression, capabilities of the host and device, the data's
* destination. Whole-file decompression should be hidden behind this interface
*
*/
class data_chunk_reader {
cwharris marked this conversation as resolved.
Show resolved Hide resolved
public:
/**
* @brief Get the next chunk of bytes from the data source
*
* Performs any necessary work to read and prepare the underlying data source for consumption as a
* view over device memory. Common implementations may read from a file, copy data from host
* memory, allocate temporary memory, perform iterative decompression, or even launch device
* kernels.
*
* @param size desired number of bytes
* @param stream stream to associate allocations or perform work required to obtain chunk
* @return a chunk of data up to @param size bytes, or less if no more data is avaialable
cwharris marked this conversation as resolved.
Show resolved Hide resolved
*/
virtual device_span<char const> get_next_chunk(uint32_t size, rmm::cuda_stream_view stream) = 0;
cwharris marked this conversation as resolved.
Show resolved Hide resolved
};

/**
* @brief a data source capable of creating a reader which can produce views of the data source in
* device memory.
*
*/
class data_chunk_source {
cwharris marked this conversation as resolved.
Show resolved Hide resolved
public:
virtual std::unique_ptr<data_chunk_reader> create_reader() const = 0;
};

} // namespace text
} // namespace io
} // namespace cudf
Loading