Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a chunked_pack API #13260

Merged
merged 45 commits into from
May 18, 2023
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
a8ae8cc
Chunked pack implementation
abellina Mar 10, 2023
3ea2844
Fix code style
abellina May 1, 2023
2201de5
Merge branch 'branch-23.06' into chunked_pack
abellina May 1, 2023
fe3bd77
Const fixes. Structured binding. Use buffer instead of bounce buffer
abellina May 6, 2023
4a4da2f
Use rmm::device_uvector instead of rmm::device_buffer in benchmark
abellina May 6, 2023
bd87dd5
Address review comments
abellina May 7, 2023
4aae464
Merge branch 'chunked_pack' of github.com:abellina/cudf into chunked_…
abellina May 7, 2023
bd389ae
Fix conflicts in upmerge to 23.06
abellina May 7, 2023
3b83b03
Add more comments in the contiguous_split_state constructor
abellina May 7, 2023
425a313
Use std::get<> and remove commented out code
abellina May 9, 2023
f7b18d3
Merge branch 'branch-23.06' into chunked_pack
abellina May 9, 2023
cc9acb4
fix styles
abellina May 9, 2023
9b3a591
chunked_pack::create and address other reviewer feedback
abellina May 10, 2023
60a29b0
Merge branch 'branch-23.06' into chunked_pack
abellina May 10, 2023
8c02d02
Merge branch 'branch-23.06' into chunked_pack
abellina May 11, 2023
0480ef1
Merge branch 'branch-23.06' into chunked_pack
abellina May 11, 2023
8f9dbbd
Update due to code review comments
abellina May 12, 2023
e2ca459
Const fixes as suggested in review
abellina May 12, 2023
394ab62
Const changes
abellina May 12, 2023
c4864f1
Style fixes
abellina May 12, 2023
3922d48
Upmerge + fix conflicts
abellina May 12, 2023
78361a8
Merge branch 'branch-23.06' into chunked_pack
abellina May 15, 2023
be5fb01
Remove header that is not neededfrom split_tests.cpp
abellina May 16, 2023
e0686ff
Changed to contiguous_split.hpp/.cu per review
abellina May 16, 2023
2a822e6
Remove stream from API
abellina May 16, 2023
8cf0afa
Merge branch 'branch-23.06' into chunked_pack
abellina May 16, 2023
2ebb68b
Update return documentation
abellina May 16, 2023
0e10fdc
Initializer list + const fixes
abellina May 16, 2023
8c344a4
Rely on free standing column_views for edge case + remove extra slash…
abellina May 16, 2023
a9dfbc7
Merge branch 'branch-23.06' into chunked_pack
abellina May 16, 2023
7a73d57
Remove extra header in split_tests.cpp
abellina May 17, 2023
dae5ceb
Introduce temp_mr. Comment edits and some more initializer list and c…
abellina May 17, 2023
ff6ce21
bif_shift -> bit_shift
abellina May 17, 2023
b47f2c7
Leverage make_device_uvector_async
abellina May 17, 2023
b1fba4e
Remove extra brackets and use range-based for loop
abellina May 17, 2023
5ef8bd3
Merge branch 'chunked_pack' of github.com:abellina/cudf into chunked_…
abellina May 17, 2023
ab98d70
Remove extra cast and use .front
abellina May 17, 2023
8ec29fa
Fix styles
abellina May 17, 2023
f13156e
Merge branch 'branch-23.06' into chunked_pack
abellina May 17, 2023
d959039
Move temp_mr comment to chunked_pack::create and make sure temp_mr de…
abellina May 18, 2023
e16306c
Make contiguous_split_state constructor private and along with it oth…
abellina May 18, 2023
b8dcff8
Fix comment as per review
abellina May 18, 2023
a41db75
Add doxygen coments in contiguous_split_state and chunk_iteration_state
abellina May 18, 2023
bf208b4
Make packed_split_indices_and_src_buf_info not explicit
abellina May 18, 2023
3a3e9cd
Merge branch 'branch-23.06' into chunked_pack
abellina May 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 81 additions & 9 deletions cpp/benchmarks/copying/contiguous_split.cu
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,30 @@

#include <thrust/iterator/counting_iterator.h>

template <typename T>
void contiguous_split(cudf::table_view const& src_table, std::vector<cudf::size_type> const& splits)
{
auto result = cudf::contiguous_split(src_table, splits);
}

void chunked_pack(cudf::table_view const& src_table, std::vector<cudf::size_type> const&)
{
auto const mr = rmm::mr::get_current_device_resource();
auto const stream = cudf::get_default_stream();
auto user_buffer = rmm::device_uvector<std::uint8_t>(100L * 1024 * 1024, stream, mr);
auto chunked_pack = cudf::chunked_pack::create(src_table, user_buffer.size(), mr);
while (chunked_pack->has_next()) {
auto iter_size = chunked_pack->next(user_buffer);
}
stream.synchronize();
}

template <typename T, typename ContigSplitImpl>
void BM_contiguous_split_common(benchmark::State& state,
std::vector<T>& src_cols,
int64_t num_rows,
int64_t num_splits,
int64_t bytes_total)
int64_t bytes_total,
ContigSplitImpl& impl)
{
// generate splits
std::vector<cudf::size_type> splits;
Expand All @@ -57,16 +75,18 @@ void BM_contiguous_split_common(benchmark::State& state,

for (auto _ : state) {
cuda_event_timer raii(state, true); // flush_l2_cache = true, stream = 0
auto result = cudf::contiguous_split(src_table, splits);
impl(src_table, splits);
}

// it's 2x bytes_total because we're both reading and writing.
state.SetBytesProcessed(static_cast<int64_t>(state.iterations()) * bytes_total * 2);
}

class ContiguousSplit : public cudf::benchmark {};
class ChunkedPack : public cudf::benchmark {};

void BM_contiguous_split(benchmark::State& state)
template <typename ContiguousSplitImpl>
void BM_contiguous_split(benchmark::State& state, ContiguousSplitImpl& impl)
{
int64_t const total_desired_bytes = state.range(0);
cudf::size_type const num_cols = state.range(1);
Expand All @@ -91,12 +111,14 @@ void BM_contiguous_split(benchmark::State& state)
(include_validity ? (max(int64_t{1}, (num_rows / 32)) * sizeof(cudf::bitmask_type) * num_cols)
: 0);

BM_contiguous_split_common(state, src_cols, num_rows, num_splits, total_bytes);
BM_contiguous_split_common(state, src_cols, num_rows, num_splits, total_bytes, impl);
}

class ContiguousSplitStrings : public cudf::benchmark {};
class ChunkedPackStrings : public cudf::benchmark {};

void BM_contiguous_split_strings(benchmark::State& state)
template <typename ContiguousSplitImpl>
void BM_contiguous_split_strings(benchmark::State& state, ContiguousSplitImpl& impl)
{
int64_t const total_desired_bytes = state.range(0);
cudf::size_type const num_cols = state.range(1);
Expand Down Expand Up @@ -133,13 +155,13 @@ void BM_contiguous_split_strings(benchmark::State& state)
(include_validity ? (max(int64_t{1}, (num_rows / 32)) * sizeof(cudf::bitmask_type) * num_cols)
: 0);

BM_contiguous_split_common(state, src_cols, num_rows, num_splits, total_bytes);
BM_contiguous_split_common(state, src_cols, num_rows, num_splits, total_bytes, impl);
}

#define CSBM_BENCHMARK_DEFINE(name, size, num_columns, num_splits, validity) \
BENCHMARK_DEFINE_F(ContiguousSplit, name)(::benchmark::State & state) \
{ \
BM_contiguous_split(state); \
BM_contiguous_split(state, contiguous_split); \
} \
BENCHMARK_REGISTER_F(ContiguousSplit, name) \
->Args({size, num_columns, num_splits, validity}) \
Expand Down Expand Up @@ -168,7 +190,7 @@ CSBM_BENCHMARK_DEFINE(1Gb1ColValidityNoSplits, (int64_t)1 * 1024 * 1024 * 1024,
#define CSBM_STRINGS_BENCHMARK_DEFINE(name, size, num_columns, num_splits, validity) \
BENCHMARK_DEFINE_F(ContiguousSplitStrings, name)(::benchmark::State & state) \
{ \
BM_contiguous_split_strings(state); \
BM_contiguous_split_strings(state, contiguous_split); \
} \
BENCHMARK_REGISTER_F(ContiguousSplitStrings, name) \
->Args({size, num_columns, num_splits, validity}) \
Expand All @@ -189,3 +211,53 @@ CSBM_STRINGS_BENCHMARK_DEFINE(1Gb10ColsNoValidity, (int64_t)1 * 1024 * 1024 * 10
CSBM_STRINGS_BENCHMARK_DEFINE(1Gb10ColsValidity, (int64_t)1 * 1024 * 1024 * 1024, 10, 256, 1);
CSBM_STRINGS_BENCHMARK_DEFINE(1Gb1ColNoSplits, (int64_t)1 * 1024 * 1024 * 1024, 1, 0, 0);
CSBM_STRINGS_BENCHMARK_DEFINE(1Gb1ColValidityNoSplits, (int64_t)1 * 1024 * 1024 * 1024, 1, 0, 1);

#define CCSBM_BENCHMARK_DEFINE(name, size, num_columns, num_splits, validity) \
BENCHMARK_DEFINE_F(ChunkedPack, name)(::benchmark::State & state) \
{ \
BM_contiguous_split(state, chunked_pack); \
} \
BENCHMARK_REGISTER_F(ChunkedPack, name) \
->Args({size, num_columns, num_splits, validity}) \
->Unit(benchmark::kMillisecond) \
->UseManualTime() \
->Iterations(8)
CCSBM_BENCHMARK_DEFINE(6Gb512ColsNoValidity, (int64_t)6 * 1024 * 1024 * 1024, 512, 0, 0);
CCSBM_BENCHMARK_DEFINE(6Gb512ColsValidity, (int64_t)6 * 1024 * 1024 * 1024, 512, 0, 1);
CCSBM_BENCHMARK_DEFINE(6Gb10ColsNoValidity, (int64_t)6 * 1024 * 1024 * 1024, 10, 0, 0);
CCSBM_BENCHMARK_DEFINE(6Gb10ColsValidity, (int64_t)6 * 1024 * 1024 * 1024, 10, 0, 1);

CCSBM_BENCHMARK_DEFINE(4Gb512ColsNoValidity, (int64_t)4 * 1024 * 1024 * 1024, 512, 0, 0);
CCSBM_BENCHMARK_DEFINE(4Gb512ColsValidity, (int64_t)4 * 1024 * 1024 * 1024, 512, 0, 1);
CCSBM_BENCHMARK_DEFINE(4Gb10ColsNoValidity, (int64_t)4 * 1024 * 1024 * 1024, 10, 0, 0);
CCSBM_BENCHMARK_DEFINE(4Gb10ColsValidity, (int64_t)4 * 1024 * 1024 * 1024, 10, 0, 1);
CCSBM_BENCHMARK_DEFINE(4Gb4ColsValidity, (int64_t)1 * 1024 * 1024 * 1024, 4, 0, 1);

CCSBM_BENCHMARK_DEFINE(1Gb512ColsNoValidity, (int64_t)1 * 1024 * 1024 * 1024, 512, 0, 0);
CCSBM_BENCHMARK_DEFINE(1Gb512ColsValidity, (int64_t)1 * 1024 * 1024 * 1024, 512, 0, 1);
CCSBM_BENCHMARK_DEFINE(1Gb10ColsNoValidity, (int64_t)1 * 1024 * 1024 * 1024, 10, 0, 0);
CCSBM_BENCHMARK_DEFINE(1Gb10ColsValidity, (int64_t)1 * 1024 * 1024 * 1024, 10, 0, 1);
CCSBM_BENCHMARK_DEFINE(1Gb1ColValidity, (int64_t)1 * 1024 * 1024 * 1024, 1, 0, 1);

#define CCSBM_STRINGS_BENCHMARK_DEFINE(name, size, num_columns, num_splits, validity) \
BENCHMARK_DEFINE_F(ChunkedPackStrings, name)(::benchmark::State & state) \
{ \
BM_contiguous_split_strings(state, chunked_pack); \
} \
BENCHMARK_REGISTER_F(ChunkedPackStrings, name) \
->Args({size, num_columns, num_splits, validity}) \
->Unit(benchmark::kMillisecond) \
->UseManualTime() \
->Iterations(8)

CCSBM_STRINGS_BENCHMARK_DEFINE(4Gb512ColsNoValidity, (int64_t)4 * 1024 * 1024 * 1024, 512, 0, 0);
CCSBM_STRINGS_BENCHMARK_DEFINE(4Gb512ColsValidity, (int64_t)4 * 1024 * 1024 * 1024, 512, 0, 1);
CCSBM_STRINGS_BENCHMARK_DEFINE(4Gb10ColsNoValidity, (int64_t)4 * 1024 * 1024 * 1024, 10, 0, 0);
CCSBM_STRINGS_BENCHMARK_DEFINE(4Gb10ColsValidity, (int64_t)4 * 1024 * 1024 * 1024, 10, 0, 1);
CCSBM_STRINGS_BENCHMARK_DEFINE(4Gb4ColsValidity, (int64_t)1 * 1024 * 1024 * 1024, 4, 0, 1);

CCSBM_STRINGS_BENCHMARK_DEFINE(1Gb512ColsNoValidity, (int64_t)1 * 1024 * 1024 * 1024, 512, 0, 0);
CCSBM_STRINGS_BENCHMARK_DEFINE(1Gb512ColsValidity, (int64_t)1 * 1024 * 1024 * 1024, 512, 0, 1);
CCSBM_STRINGS_BENCHMARK_DEFINE(1Gb10ColsNoValidity, (int64_t)1 * 1024 * 1024 * 1024, 10, 0, 0);
CCSBM_STRINGS_BENCHMARK_DEFINE(1Gb10ColsValidity, (int64_t)1 * 1024 * 1024 * 1024, 10, 0, 1);
CCSBM_STRINGS_BENCHMARK_DEFINE(1Gb1ColValidity, (int64_t)1 * 1024 * 1024 * 1024, 1, 0, 1);
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
141 changes: 141 additions & 0 deletions cpp/include/cudf/contiguous_split.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,147 @@ std::vector<packed_table> contiguous_split(
std::vector<size_type> const& splits,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

namespace detail {
struct contiguous_split_state;
};

/**
* @brief Perform a chunked "pack" operation of the input `table_view` using a user provided
* buffer of size `user_buffer_size`.
*
* The intent of this operation is to be used in a streamed fashion at times of GPU
* out-of-memory, where we want to minimize the number of small cudaMemcpy calls and
* tracking of all the metadata associated with cudf tables. Because of the memory constraints,
* all thrust and scratch memory allocations are using the passed-in memory resource exclusively,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with documentation, I think this behavior is unexpected given the general approach that libcudf takes. I understand why we might need to support this, but I would prefer it be communicated very clearly in the API and not rely solely on documentation. Can we modify the struct to have two memory resources, the usual mr and an extra temp_mr or so? They can both default to the per-device resource. That way the caller gets a very clear indication that the memory resource used for temp allocations is also controllable in this particular instance.

If all allocations made by this function are temporary because it's just splitting an existing buffer, then maybe just renaming the mr makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunked_pack never allocates result buffers. So I am going to replace mr in this api with temp_mr to make it clear, as you suggest

* not a per-device memory resource.
*
* The caller has two methods it can use to carry out the chunked_pack: has_next and next.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The caller has two methods it can use to carry out the chunked_pack: has_next and next.
* This class defines two methods that must be used in concert to carry out the chunked_pack: has_next and next.

* Here is an example:
*
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
* // Create a table_view
* cudf::table_view tv = ...;
*
* // Choose a memory resource. This memory resource is used for scratch/thrust temporary
* // data. In memory constrained cases, this can be used to set aside scratch memory
* // for `chunked_pack` at the beginning of a program.
* auto mr = rmm::mr::get_current_device_resource();
*
* auto stream = cudf::get_default_stream();
*
* // Define a buffer size for each chunk: the larger the buffer is, the more SMs can be
* // occupied by this algorithm.
* //
* // Internally, the GPU unit-of-work is a 1MB batch. When we instantiate `cudf::chunked_pack`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* // Internally, the GPU unit-of-work is a 1MB batch. When we instantiate `cudf::chunked_pack`,
* // Internally, the GPU unit of work is a 1MB batch. When we instantiate `cudf::chunked_pack`,

* // all the 1MB batches for the source table_view are computed up front. Additionally,
* // chunked_pack calculates the number of iterations that are required to go through all those
* // batches given a `user_buffer_size` buffer. The number of 1MB batches in each iteration (chunk)
* // equals the number of CUDA blocks that will be used for the main kernel launch.
* //
* std::size_t user_buffer_size = 128*1024*1024;
*
* auto chunked_packer = cudf::chunked_pack::create(tv, user_buffer_size, stream, mr);
*
* std::size_t host_offset = 0;
* auto host_buffer = ...; // obtain a host buffer you would like to copy to
*
* while (chunked_packer->has_next()) {
* // get a user buffer of size `user_buffer_size`
* cudf::device_span<uint8_t> user_buffer = ...;
* std::size_t bytes_copied = chunked_packer->next(user_buffer);
*
* // buffer will hold the contents of at most `user_buffer_size` bytes
* // of the contiguously packed input `table_view`. You are now free to copy
* // this memory somewhere else, for example, to host.
* cudaMemcpyAsync(
* host_buffer.data() + host_offset,
* user_buffer.data(),
* bytes_copied,
* cudaMemcpyDefault,
* stream);
*
* host_offset += bytes_copied;
* }
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
*/
class chunked_pack {
public:
/**
* @brief Construct a `chunked_pack` class.
*
* @param input source `table_view` to pack
* @param user_buffer_size buffer size (in bytes) that will be passed on `next`. Must be
* at least 1MB
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr RMM memory resource to be used for temporary and scratch allocations only
*/
explicit chunked_pack(cudf::table_view const& input,
std::size_t user_buffer_size,
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);

/**
* @brief Destructor that will be implemented as default, required because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Destructor that will be implemented as default, required because
* @brief Destructor that will be implemented as default. Declared with definition here because

* contiguous_split_state is incomplete at this stage.
*/
~chunked_pack();

/**
* @brief Obtain the total size of the contiguously packed `table_view`.
*
* @return total size (in bytes) of all the chunks
*/
[[nodiscard]] std::size_t get_total_contiguous_size() const;

/**
* @brief Function to check if there are chunks left to be copied.
*
* @return true if there are chunks left to be copied, and false otherwise
*/
[[nodiscard]] bool has_next() const;

/**
* @brief Packs the next chunk into `user_buffer`. This should be call as long as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Packs the next chunk into `user_buffer`. This should be call as long as
* @brief Packs the next chunk into `user_buffer`. This should be called as long as

* `has_next` returns true. If `next` is called when `has_next` is false, an exception
* is thrown.
*
* @throws cudf::logic_error If the size of `user_buffer` is different than `user_buffer_size`
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
* @throws cudf::logic_error If called after all chunks have been copied
*
* @param user_buffer device span target for the chunk. The size of this span must equal
* the `user_buffer_size` parameter passed at construction
* @return The number of bytes that were written to `user_buffer` (at most
* `user_buffer_size`)
*/
[[nodiscard]] std::size_t next(cudf::device_span<uint8_t> const& user_buffer);

/**
* @brief Build the opaque metadata for all added columns.
*
* @return A vector containing the serialized column metadata
*/
[[nodiscard]] std::unique_ptr<std::vector<uint8_t>> build_metadata() const;

/**
* @brief Creates a `chunked_pack` instance to perform a "pack" of the `table_view`
* "input", where a buffer of `user_buffer_size` is filled with chunks of the
* overall operation. This operation can be used in cases where GPU memory is constrained.
*
* @throws cudf::logic_error When user_buffer_size is less than 1MB
*
* @param input source `table_view` to pack
* @param user_buffer_size buffer size (in bytes) that will be passed on `next`. Must be
* at least 1MB
* @param mr RMM memory resource to be used for temporary and scratch allocations only
* @return a unique_ptr of chunked_pack
*/
[[nodiscard]] static std::unique_ptr<chunked_pack> create(cudf::table_view const& input,
std::size_t user_buffer_size,
rmm::mr::device_memory_resource* mr);

private:
// internal state of contiguous split
std::unique_ptr<detail::contiguous_split_state> state;
};

/**
* @brief Deep-copy a `table_view` into a serialized contiguous memory format.
*
Expand Down
Loading