Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multibyte_split #8702

Merged
merged 93 commits into from
Aug 24, 2021
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
e1b71e6
multibyte-split scaffolding
cwharris Jun 27, 2021
836773a
cudf::io::text::input_stream
cwharris Jun 27, 2021
3e06c18
trie test scaffolding
cwharris Jul 2, 2021
ac14dbd
superstate + tests
cwharris Jul 7, 2021
ea8cee2
added device trie
cwharris Jul 7, 2021
a4a8dd0
add superstate to multibyte_split
cwharris Jul 7, 2021
094d2d2
cub block scan superstates
cwharris Jul 8, 2021
1117ab8
block-wide superstate matching
cwharris Jul 9, 2021
51b1444
fix superstate constructor bug where only the first 8 states were ini…
cwharris Jul 9, 2021
d1f7eb3
multibyte_split multiple delimeter support
cwharris Jul 9, 2021
a628d73
scan output-offsets in multibyte_split
cwharris Jul 9, 2021
e1cc84d
printf offsets in multibyte_split
cwharris Jul 9, 2021
c7177bc
add match-length to trie to adjust for output offset in multibyte_split
cwharris Jul 9, 2021
42dc014
adjust multibyte_split test case to expect delimiters to be retained …
cwharris Jul 9, 2021
5171711
printf match_begin and match_end for multibyte_split
cwharris Jul 9, 2021
6b62ceb
multibyte_split test passing
cwharris Jul 10, 2021
a2c9756
add multibyte_split comments, break test intentionally to work on mul…
cwharris Jul 12, 2021
21b8b25
multibyte_split add multi-block support
cwharris Jul 13, 2021
f59a93e
rename BYTES_PER_TILE to ITEMS_PER_TILE
cwharris Jul 13, 2021
5fa112a
add bounds check to multibyte_split load and flag
cwharris Jul 14, 2021
cf42fd0
multibyte_split benchmark scaffolding
cwharris Jul 14, 2021
e6e9741
multibyte_split increase threads per block and adjust test case.
cwharris Jul 14, 2021
b5c2e05
use circular buffer in multibyte_split to allow for stream inputs
cwharris Jul 16, 2021
738af48
update multibyte_split to work with streaming inputs
cwharris Jul 16, 2021
0121b22
consolidate two passes of stream-scanning to a single function
cwharris Jul 16, 2021
a233ca2
add tile_state partial to multibyte_split but dont use yet
cwharris Jul 16, 2021
4946058
add reusable tilestate callback to `multibyte_split`
cwharris Jul 16, 2021
d69aeca
begin working on warp-reduce window aggregation of tile state in mult…
cwharris Jul 16, 2021
079d1ea
fix multibyte_split bug where non-streaming approach would hang
cwharris Jul 17, 2021
970aac2
interleaved streaming io for multibyte_split
cwharris Jul 18, 2021
fee7ebb
use no-copy string column construction in multibyte_split
cwharris Jul 19, 2021
e5a5204
document multibyte_split minimum tile count requirements
cwharris Jul 19, 2021
216d620
Merge branch 'branch-21.10' into multibyte-split
cwharris Jul 19, 2021
65af4de
multibyte_split tunable concurrency via stream pool
cwharris Jul 22, 2021
a4fe128
multibyte_split remove device_istream replace with data_chunk_reader
cwharris Jul 23, 2021
9bc6c89
add data_chunk_source factories, nvtx ranges to multibyte_split, use …
cwharris Jul 23, 2021
08b3069
use make_device_uvector_async in trie.hpp
cwharris Jul 23, 2021
7088791
rm device_istream
cwharris Jul 23, 2021
b61c14f
multibyte_split add some docs, add more test cases
cwharris Jul 23, 2021
017f05d
revert CMakeLists ordering
cwharris Jul 23, 2021
f432e68
convert trie storage from SOA to AOS
cwharris Jul 25, 2021
59a70a9
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 26, 2021
f1d3b4a
fix spelling mistakes
cwharris Jul 26, 2021
51ac35c
break multibyte_split by adding queue/multistate support
cwharris Jul 27, 2021
3d04556
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 28, 2021
1fb36ee
fix `abac` pattern matching test, introduce new bug :(
cwharris Jul 29, 2021
ecf440a
fix multibyte_split aggregation strategy to avoid assuming T{} is an …
cwharris Jul 29, 2021
9e34efb
Merge branch 'multibyte-split-queue' into multibyte-split
cwharris Jul 29, 2021
fc014e5
add second host buffer to istream_data_chunk_reader to facilitate ove…
cwharris Jul 29, 2021
896ed31
actually add second buffer to istream_data_chunk_reader
cwharris Jul 29, 2021
7792521
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Jul 30, 2021
2f75b50
clean up multibyte_split code
cwharris Jul 30, 2021
162e9cf
adjust copyright
cwharris Jul 30, 2021
ade1150
remove confusing test case in multibyte_split
cwharris Jul 30, 2021
8e08012
limit multibyte_split to 32 threads, because of a bug that needs fixi…
cwharris Jul 30, 2021
5ad2148
fix emoji bits documentation
cwharris Jul 31, 2021
511ab9f
style adjustments and documentation update to multibyte_split
cwharris Aug 2, 2021
69280e8
move tile-scanning utilites to detail namespace
cwharris Aug 2, 2021
2d37dc9
remove "inline" from constexpr members in cudf::io::text
cwharris Aug 2, 2021
9c6bf2a
fix large input bug in multibyte_split where offsets were not account…
cwharris Aug 3, 2021
ee817b1
improve data_chunk_reader docs
cwharris Aug 3, 2021
4cdbee5
make multibyte_split accept data_chunk_source as a const& arg
cwharris Aug 3, 2021
c3783db
add tile_state.hpp to meta.yaml
cwharris Aug 3, 2021
432399c
create bad-case scenario benchmark
cwharris Aug 3, 2021
ad21c4f
remove data_chunk in favor of device_span until it becomes clear an r…
cwharris Aug 4, 2021
18e0863
use std::vector<cuda_stream_view> instread of stream_pool
cwharris Aug 4, 2021
45e5b65
rename ticket to h_ticket
cwharris Aug 4, 2021
ee122a8
adjust `scan_tile_state_view::get_prefix` to make the purpose of thre…
cwharris Aug 4, 2021
c9d2889
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 5, 2021
ca6bbac
fix UB in multibyte_split concurrent kernel execution, improve perf
cwharris Aug 6, 2021
d68d951
add error messages to multibyte_split to indicate unsupported use cases
cwharris Aug 6, 2021
9684646
remove __threadfence() in favor of cuda::atomic
cwharris Aug 9, 2021
d3de062
improve multibyte_split benchmarks
cwharris Aug 13, 2021
d392140
provide explicit memory_order for tile state status stores.
cwharris Aug 13, 2021
42b8c88
improve multibyte_split benchmarks
cwharris Aug 13, 2021
b976525
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 13, 2021
d50f815
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 13, 2021
40d81e8
add file and host benchmarks for multibyte_split
cwharris Aug 14, 2021
3171339
make use of div_rounding_up_safe
cwharris Aug 14, 2021
63c4bb0
remove unused temp storage from tile state callback
cwharris Aug 14, 2021
eda265b
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 17, 2021
05cdecf
simplify multibyte_split api to accept only a single delimiter
cwharris Aug 17, 2021
a4d4d79
add strings column factory which takes device_uvectors
cwharris Aug 19, 2021
cef897d
add docs to cudf::io::text::detail::trie
cwharris Aug 19, 2021
097cadd
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 23, 2021
89ce0aa
add more documentation and comments to multibyte_split related code
cwharris Aug 23, 2021
d2735dd
adjust multibyte_split benchmark deviation math to be representative …
cwharris Aug 23, 2021
5a1e4d6
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 24, 2021
c15b5d2
Merge branch 'multibyte-split' of github.com:cwharris/cudf into multi…
cwharris Aug 24, 2021
615534d
multibyte_split: replace typedef with using and replace uint32_t with…
cwharris Aug 24, 2021
bd67026
make data_chunk_reader::get_next_chunk docs more informative.
cwharris Aug 24, 2021
a61fd09
fix style
cwharris Aug 24, 2021
b0d4135
Merge branch 'branch-21.10' of github.com:rapidsai/cudf into multibyt…
cwharris Aug 24, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,7 @@ add_library(cudf
src/io/parquet/writer_impl.cu
src/io/statistics/orc_column_statistics.cu
src/io/statistics/parquet_column_statistics.cu
src/io/text/multibyte_split.cu
src/io/utilities/column_buffer.cpp
src/io/utilities/data_sink.cpp
src/io/utilities/datasource.cpp
Expand Down
5 changes: 5 additions & 0 deletions cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -242,3 +242,8 @@ ConfigureBench(STRINGS_BENCH
# - json benchmark -------------------------------------------------------------------
ConfigureBench(JSON_BENCH
string/json_benchmark.cpp)

###################################################################################################
# - io benchmark ---------------------------------------------------------------------
ConfigureBench(MULTIBYTE_SPLIT_BENCHMARK
io/text/multibyte_split_benchmark.cpp)
2 changes: 2 additions & 0 deletions cpp/benchmarks/io/cuio_benchmark_common.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ using cudf::io::io_type;
benchmark(name##_buffer_output, type_or_group, static_cast<uint32_t>(io_type::HOST_BUFFER)); \
benchmark(name##_void_output, type_or_group, static_cast<uint32_t>(io_type::VOID));

std::string random_file_in_dir(std::string const& dir_path);

/**
* @brief Class to create a coupled `source_info` and `sink_info` of given type.
*/
Expand Down
82 changes: 82 additions & 0 deletions cpp/benchmarks/io/text/multibyte_split_benchmark.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
/*
* Copyright (c) 2019, NVIDIA CORPORATION.
cwharris marked this conversation as resolved.
Show resolved Hide resolved
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <benchmarks/fixture/benchmark_fixture.hpp>
#include <benchmarks/io/cuio_benchmark_common.hpp>
#include <benchmarks/synchronization/synchronization.hpp>

#include <cudf_test/column_wrapper.hpp>
#include <cudf_test/file_utilities.hpp>

#include <cudf/io/text/data_chunk_source_factories.hpp>
#include <cudf/io/text/multibyte_split.hpp>
#include <cudf/types.hpp>

#include <thrust/transform.h>

#include <cstdio>
#include <fstream>
#include <memory>

using cudf::test::fixed_width_column_wrapper;

temp_directory const temp_dir("cudf_gbench");

static void BM_multibyte_split(benchmark::State& state)
{
auto delimiters = std::vector<std::string>({"😀", "😎", ",", "::"});

int32_t num_chars = state.range(0);
auto host_input = std::string(num_chars, 'x');
auto device_input = cudf::string_scalar(host_input);

auto temp_file_name = random_file_in_dir(temp_dir.path());

close(mkstemp(const_cast<char*>(temp_file_name.data())));
cwharris marked this conversation as resolved.
Show resolved Hide resolved
{
auto temp_fostream = std::ofstream(temp_file_name, std::ofstream::out);
temp_fostream << host_input;
temp_fostream.close();
cwharris marked this conversation as resolved.
Show resolved Hide resolved
}

cudaDeviceSynchronize();

auto source = cudf::io::text::make_source_from_file(temp_file_name);
// auto source = cudf::io::text::make_source(device_input);
// auto source = cudf::io::text::make_source(host_input);
cwharris marked this conversation as resolved.
Show resolved Hide resolved

for (auto _ : state) {
cuda_event_timer raii(state, true);
auto output = cudf::io::text::multibyte_split(*source, delimiters);
}

state.SetBytesProcessed(state.iterations() * num_chars);
}

class MultibyteSplitBenchmark : public cudf::benchmark {
};

#define TRANSPOSE_BM_BENCHMARK_DEFINE(name) \
BENCHMARK_DEFINE_F(MultibyteSplitBenchmark, name)(::benchmark::State & state) \
{ \
BM_multibyte_split(state); \
} \
BENCHMARK_REGISTER_F(MultibyteSplitBenchmark, name) \
->Range(1 << 30, 1 << 30) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a parameter dimension that specifies the ratio of delimiters per character (e.g., 1:8, 1:16, 1:32) and test performance as a function of it?

Copy link
Contributor Author

@cwharris cwharris Jul 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. benchmark is not up to snuff. if there are a significant amount of delimiters, perf could be lower, but only because it would influence the number of offsets that need to be written. otherwise there's not a lot of divergence in the implementation.

Copy link
Contributor Author

@cwharris cwharris Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to talk to @davidwendt about how to generate data for this benchmark.

->UseManualTime() \
->Unit(benchmark::kMillisecond);

TRANSPOSE_BM_BENCHMARK_DEFINE(multibyte_split_simple);
47 changes: 47 additions & 0 deletions cpp/include/cudf/io/text/data_chunk_source.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
#pragma once

#include <cudf/utilities/span.hpp>

#include <rmm/cuda_stream_pool.hpp>
#include <rmm/device_buffer.hpp>

namespace cudf {
namespace io {
namespace text {

/**
* @brief represents a possibly-shared view over device memory.
*/
struct data_chunk {
cwharris marked this conversation as resolved.
Show resolved Hide resolved
data_chunk(device_span<char const> data) : _data(data) {}

operator cudf::device_span<char const>() { return _data; }

uint32_t size() const { return _data.size(); }

private:
device_span<char const> _data;
};

/**
* @brief a reader capable of producing views over device memory
*
*/
class data_chunk_reader {
cwharris marked this conversation as resolved.
Show resolved Hide resolved
public:
virtual data_chunk get_next_chunk(uint32_t size, rmm::cuda_stream_view stream) = 0;
};

/**
* @brief a data source capable of creating a reader which can produce views of the data source in
* device memory.
*
*/
class data_chunk_source {
cwharris marked this conversation as resolved.
Show resolved Hide resolved
public:
virtual std::unique_ptr<data_chunk_reader> create_reader() = 0;
};

} // namespace text
} // namespace io
} // namespace cudf
201 changes: 201 additions & 0 deletions cpp/include/cudf/io/text/data_chunk_source_factories.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
#pragma once

#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/io/text/data_chunk_source.hpp>
#include <cudf/scalar/scalar.hpp>

#include <rmm/device_buffer.hpp>

#include <thrust/host_vector.h>
#include <thrust/system/cuda/experimental/pinned_allocator.h>

#include <fstream>
#include <memory>
#include <sstream>
#include <string>
#include <unordered_map>

namespace cudf {
namespace io {
namespace text {

namespace {

/**
* @brief a reader which produces views of device memory which contain a copy of the data from an
* istream.
*
*/
class istream_data_chunk_reader : public data_chunk_reader {
public:
istream_data_chunk_reader(std::unique_ptr<std::istream> datastream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentionally taking ownership of the istream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is.

Copy link
Contributor

@jrhemstad jrhemstad Aug 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it just be like this then?

Suggested change
istream_data_chunk_reader(std::unique_ptr<std::istream> datastream)
istream_data_chunk_reader(std::istream&& datastream)

There's no reason to mandate how the lifetime of the istream object is managed, right?

: _datastream(std::move(datastream)), _buffers()
{
// create an event to track the completion of the last device-to-host copy.
CUDA_TRY(cudaEventCreate(&prev_host_copy_event)); //
}

~istream_data_chunk_reader()
{
CUDA_TRY(cudaEventDestroy(prev_host_copy_event)); //
}

device_span<char> find_or_create_data(uint32_t size, rmm::cuda_stream_view stream)
{
auto search = _buffers.find(stream.value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can silently break in at least two scenarios. First, if find_or_create_data is called from two threads with their per-thread default stream. In both cases the cudaStream_t's value will be 0, but they refer to different streams.

Second, if a stream is created, used in find_or_create_data, destroyed, and a new stream is created and reuses the same stream ID, then once again you'll have a handle to the wrong stream.

Any time you're trying to use a stream ID as a key you run into these problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this is an important thing to get right, especially if we decide to re-use it in other places in cuio. Right now the scope of this cache is limited to a single reading of the file, so the only invocations of this function will always see non-default streams from a stream pool, and once the work is done both the stream pool and cache will be destroyed.

How important is it to get this code reusable in this PR vs get it right the next time we try to use it? Would it be better to document the caveats and make a more general solution later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see these being a problem in the current use case, and won't have a better solution until there are more use cases. Ok to resolve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get rid of the thread-id based lookup and use a concurrent collection to store unused tickets. if no tickets are available, one is created (up to a certain number). when a ticket is no longer in use, it can be returned to the collection. this way it doesn’t matter what stream is used to read the chunk.

Copy link
Contributor Author

@cwharris cwharris Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impractical to address differently unless we change rmm's sync-and-steal behavior or use cudaStreamAddCallback to indicate when a buffer is ready for re-use. :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be resolved by mapping the user specified stream to an internal side stream and synchronizing the user specified stream with the internal stream. That way you know the lifetime of the internal streams.


if (search == _buffers.end() || search->second.size() < size) {
_buffers[stream.value()] = rmm::device_buffer(size, stream);
}

return device_span<char>(static_cast<char*>(_buffers[stream.value()].data()), size);
}

data_chunk get_next_chunk(uint32_t read_size, rmm::cuda_stream_view stream) override
{
CUDF_FUNC_RANGE();
cwharris marked this conversation as resolved.
Show resolved Hide resolved

// synchronize on the last host-to-device copy, so we don't clobber the host buffer.
CUDA_TRY(cudaEventSynchronize(prev_host_copy_event));

// resize the host buffer as necessary to contain the requested number of bytes
if (_host_buffer.size() < read_size) { _host_buffer.resize(read_size); }

// read data from the host istream in to the pinned host memory buffer
_datastream->read(_host_buffer.data(), read_size);

// adjust the read size to reflect how many bytes were actually read from the data stream
read_size = _datastream->gcount();

// get a view over some device memory we can use to buffer the read data on to device.
auto chunk_span = find_or_create_data(read_size, stream);

// copy the host-pinned data on to device
CUDA_TRY(cudaMemcpyAsync( //
chunk_span.data(),
_host_buffer.data(),
read_size,
cudaMemcpyHostToDevice,
stream.value()));

// record the host-to-device copy.
CUDA_TRY(cudaEventRecord(prev_host_copy_event, stream.value()));

// return the view over device memory so it can be processed.
return data_chunk(chunk_span);
}

private:
std::unique_ptr<std::istream> _datastream;
std::unordered_map<cudaStream_t, rmm::device_buffer> _buffers;
cudaEvent_t prev_host_copy_event;
thrust::host_vector<char, thrust::system::cuda::experimental::pinned_allocator<char>>
_host_buffer{};
};

/**
* @brief a reader which produces view of device memory which represent a subset of the input device
* span
*
*/
class device_span_data_chunk_reader : public data_chunk_reader {
public:
device_span_data_chunk_reader(device_span<char const> data) : _data(data) {}

data_chunk get_next_chunk(uint32_t read_size, rmm::cuda_stream_view stream) override
{
// limit the read size to the number of bytes remaining in the device_span.
if (read_size > _data.size() - _position) { read_size = _data.size() - _position; }

// create a view over the device span
auto chunk_span = _data.subspan(_position, read_size);

// increment position
_position += read_size;

// return the view over device memory so it can be processed.
return data_chunk(chunk_span);
}

private:
device_span<char const> _data;
uint64_t _position = 0;
};

/**
* @brief a file data source which creates an istream_data_chunk_reader
*
*/
class file_data_chunk_source : public data_chunk_source {
public:
file_data_chunk_source(std::string filename) : _filename(filename) {}
std::unique_ptr<data_chunk_reader> create_reader() override
{
return std::make_unique<istream_data_chunk_reader>(
std::make_unique<std::ifstream>(_filename, std::ifstream::in));
}

private:
std::string _filename;
};

/**
* @brief a host string data source which creates an istream_data_chunk_reader
*/
class string_data_chunk_source : public data_chunk_source {
public:
string_data_chunk_source(std::string const& data) : _data(data) {}
std::unique_ptr<data_chunk_reader> create_reader() override
{
return std::make_unique<istream_data_chunk_reader>(std::make_unique<std::istringstream>(_data));
}

private:
std::string const& _data;
};

/**
* @brief a device span data source which creates an istream_data_chunk_reader
*/
class device_span_data_chunk_source : public data_chunk_source {
public:
device_span_data_chunk_source(device_span<char const> data) : _data(data) {}
std::unique_ptr<data_chunk_reader> create_reader() override
{
return std::make_unique<device_span_data_chunk_reader>(_data);
}

private:
device_span<char const> _data;
};

} // namespace

/**
* @brief Creates a data source capable of producing device-buffered views of the given string.
*/
std::unique_ptr<data_chunk_source> make_source(std::string const& data)
cwharris marked this conversation as resolved.
Show resolved Hide resolved
{
return std::make_unique<string_data_chunk_source>(data);
}

/**
* @brief Creates a data source capable of producing device-buffered views of the file
*/
std::unique_ptr<data_chunk_source> make_source_from_file(std::string const& filename)
cwharris marked this conversation as resolved.
Show resolved Hide resolved
{
return std::make_unique<file_data_chunk_source>(filename);
}

/**
* @brief Creates a data source capable of producing views of the given device string scalar
*/
std::unique_ptr<data_chunk_source> make_source(cudf::string_scalar& data)
{
auto data_span = device_span<char const>(data.data(), data.size());
return std::make_unique<device_span_data_chunk_source>(data_span);
}

} // namespace text
} // namespace io
} // namespace cudf
22 changes: 22 additions & 0 deletions cpp/include/cudf/io/text/multibyte_split.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#include <cudf/io/text/data_chunk_source.hpp>

#include <cudf/column/column.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>

#include <iostream>
cwharris marked this conversation as resolved.
Show resolved Hide resolved
#include <memory>

namespace cudf {
namespace io {
namespace text {

std::unique_ptr<cudf::column> multibyte_split(
data_chunk_source& source,
cwharris marked this conversation as resolved.
Show resolved Hide resolved
std::vector<std::string> const& delimeters,
cwharris marked this conversation as resolved.
Show resolved Hide resolved
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

} // namespace text
} // namespace io
} // namespace cudf
Loading