Implement a chunked_pack API #13260

abellina · 2023-05-01T21:40:55Z

Description

This PR introduces a "chunked pack" API built on top of contiguous_split. This API is used when we want to copy a cuDF table_view over the wire or to the host in a contiguous layout (aka contiguous_split), but without user provided memory. As a result this API does not allocate any buffers for GPU data and instead it uses a provided user buffer to perform the contiguous_split in chunks.

Luckily, contiguous_split already had a subdivision of work that we are now calling "batches". Each batch is up to 1MB of data from the source table. As such, one can think of this function as copying as many batches as will fit in a user buffer (or "chunking the batches"). The API follows other chunked interfaces in cuDF with a has_next and a next, with the difference that in this case next takes a device_span, and the user can provide any device_span as long as the size is the same as the size specified during construction.

When thrust and scratch space is required on the GPU, this PR makes use of the memory resource passed, even to the point of using the second argument of exec_policy which is the memory resource. I found this while testing that exec_policy defaults to the per-device resource, and in this case I really wanted to pass a pooled memory resource, outside of our normal async memory resource to set aside this memory ahead of time.

Most of the changes are about moving things around. You'll see 3 structs with the name "packed" (e.g. packed_split_indices_and_src_buf_info). These are here to group together state that contiguous_split needs to work, but now because of chunked_pack we need to keep around as well for the subsequent calls to next. These structs are also packed in memory, which is an optimization contiguous_split had already done to reduce the number of d2h/h2d copies. This PR did need to add a state object (contiguous_split_state) that now contiguous_split leverages. It also makes use of the metadata_builder which we added in a prior PR.

This PR does not include the JNI changes needed for this to work on the java side, I'll post that separately. I figured this was too big already (and if people have suggestions on "chunking" this PR up, I am happy to do that).

@nvdbaranec spent a great deal of time documenting contiguous_split for me and he suggested a path to get this done that I just followed (thank you!!)

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Co-authored-by: Dave Baranec <[email protected]> Signed-off-by: Alessandro Bellina <[email protected]>

cpp/include/cudf/detail/contiguous_split.hpp

abellina · 2023-05-01T22:16:58Z

I don't see major differences when running CONTIGUOUS_SPLIT_BENCH benchmark. Here are both new and old results:

Benchmark results for this PR, with the `chunked_pack` ones at the end

----------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------------------------
ContiguousSplit/6Gb512ColsNoValidity/6442450944/512/256/0/iterations:8/manual_time              57.8 ms         57.9 ms            8 bytes_per_second=207.691G/s
ContiguousSplit/6Gb512ColsValidity/6442450944/512/256/1/iterations:8/manual_time                57.3 ms         57.3 ms            8 bytes_per_second=216.064G/s
ContiguousSplit/6Gb10ColsNoValidity/6442450944/10/256/0/iterations:8/manual_time                27.9 ms         28.0 ms            8 bytes_per_second=429.477G/s
ContiguousSplit/6Gb10ColsValidity/6442450944/10/256/1/iterations:8/manual_time                  29.2 ms         29.2 ms            8 bytes_per_second=423.943G/s
ContiguousSplit/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time              44.5 ms         44.5 ms            8 bytes_per_second=179.893G/s
ContiguousSplit/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time                47.4 ms         47.5 ms            8 bytes_per_second=173.998G/s
ContiguousSplit/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time                19.5 ms         19.5 ms            8 bytes_per_second=410.685G/s
ContiguousSplit/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time                  19.9 ms         19.9 ms            8 bytes_per_second=414.661G/s
ContiguousSplit/4Gb4ColsNoSplits/1073741824/4/0/1/iterations:8/manual_time                      4.40 ms         4.42 ms            8 bytes_per_second=468.835G/s
ContiguousSplit/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time              4.38 ms         4.40 ms            8 bytes_per_second=470.987G/s
ContiguousSplit/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time              32.6 ms         32.6 ms            8 bytes_per_second=61.4436G/s
ContiguousSplit/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time                35.8 ms         35.8 ms            8 bytes_per_second=57.6267G/s
ContiguousSplit/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time                5.70 ms         5.72 ms            8 bytes_per_second=351.129G/s
ContiguousSplit/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time                  5.90 ms         5.92 ms            8 bytes_per_second=349.677G/s
ContiguousSplit/1Gb1ColNoSplits/1073741824/1/0/1/iterations:8/manual_time                       4.40 ms         4.42 ms            8 bytes_per_second=469.279G/s
ContiguousSplit/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time               4.39 ms         4.41 ms            8 bytes_per_second=470.14G/s
ContiguousSplitStrings/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time        158 ms          158 ms            8 bytes_per_second=50.6568G/s
ContiguousSplitStrings/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time          146 ms          146 ms            8 bytes_per_second=55.5446G/s
ContiguousSplitStrings/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time         30.6 ms         31.6 ms            8 bytes_per_second=274.595G/s
ContiguousSplitStrings/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time           21.0 ms         21.0 ms            8 bytes_per_second=406.154G/s
ContiguousSplitStrings/4Gb4ColsNoSplits/1073741824/4/0/0/iterations:8/manual_time               6.71 ms         7.36 ms            8 bytes_per_second=335.519G/s
ContiguousSplitStrings/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time       4.35 ms         4.37 ms            8 bytes_per_second=524.718G/s
ContiguousSplitStrings/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time        133 ms          133 ms            8 bytes_per_second=15.0416G/s
ContiguousSplitStrings/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time          134 ms          134 ms            8 bytes_per_second=15.1684G/s
ContiguousSplitStrings/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time         9.99 ms         10.3 ms            8 bytes_per_second=210.255G/s
ContiguousSplitStrings/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time           7.71 ms         7.74 ms            8 bytes_per_second=276.398G/s
ContiguousSplitStrings/1Gb1ColNoSplits/1073741824/1/0/0/iterations:8/manual_time                6.67 ms         9.23 ms            8 bytes_per_second=449.574G/s
ContiguousSplitStrings/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time        4.35 ms         4.37 ms            8 bytes_per_second=696.746G/s
ChunkedPack/6Gb512ColsNoValidity/6442450944/512/0/0/iterations:8/manual_time                    23.9 ms         23.9 ms            8 bytes_per_second=501.71G/s
ChunkedPack/6Gb512ColsValidity/6442450944/512/0/1/iterations:8/manual_time                      24.7 ms         24.7 ms            8 bytes_per_second=500.825G/s
ChunkedPack/6Gb10ColsNoValidity/6442450944/10/0/0/iterations:8/manual_time                      23.8 ms         23.9 ms            8 bytes_per_second=503.552G/s
ChunkedPack/6Gb10ColsValidity/6442450944/10/0/1/iterations:8/manual_time                        26.7 ms         26.8 ms            8 bytes_per_second=463.174G/s
ChunkedPack/4Gb512ColsNoValidity/4294967296/512/0/0/iterations:8/manual_time                    16.0 ms         16.0 ms            8 bytes_per_second=500.369G/s
ChunkedPack/4Gb512ColsValidity/4294967296/512/0/1/iterations:8/manual_time                      17.9 ms         17.9 ms            8 bytes_per_second=461.314G/s
ChunkedPack/4Gb10ColsNoValidity/4294967296/10/0/0/iterations:8/manual_time                      17.7 ms         17.7 ms            8 bytes_per_second=451.413G/s
ChunkedPack/4Gb10ColsValidity/4294967296/10/0/1/iterations:8/manual_time                        16.4 ms         16.5 ms            8 bytes_per_second=501.532G/s
ChunkedPack/4Gb4ColsValidity/1073741824/4/0/1/iterations:8/manual_time                          4.22 ms         4.24 ms            8 bytes_per_second=489.245G/s
ChunkedPack/1Gb512ColsNoValidity/1073741824/512/0/0/iterations:8/manual_time                    4.86 ms         4.90 ms            8 bytes_per_second=411.202G/s
ChunkedPack/1Gb512ColsValidity/1073741824/512/0/1/iterations:8/manual_time                      4.66 ms         4.68 ms            8 bytes_per_second=442.465G/s
ChunkedPack/1Gb10ColsNoValidity/1073741824/10/0/0/iterations:8/manual_time                      4.18 ms         4.20 ms            8 bytes_per_second=478.992G/s
ChunkedPack/1Gb10ColsValidity/1073741824/10/0/1/iterations:8/manual_time                        4.25 ms         4.27 ms            8 bytes_per_second=485.494G/s
ChunkedPack/1Gb1ColValidity/1073741824/1/0/1/iterations:8/manual_time                           4.22 ms         4.24 ms            8 bytes_per_second=488.715G/s
ChunkedPackStrings/4Gb512ColsNoValidity/4294967296/512/0/0/iterations:8/manual_time             24.9 ms         25.0 ms            8 bytes_per_second=321.246G/s
ChunkedPackStrings/4Gb512ColsValidity/4294967296/512/0/1/iterations:8/manual_time               16.4 ms         16.5 ms            8 bytes_per_second=494.505G/s
ChunkedPackStrings/4Gb10ColsNoValidity/4294967296/10/0/0/iterations:8/manual_time               26.0 ms         27.1 ms            8 bytes_per_second=322.617G/s
ChunkedPackStrings/4Gb10ColsValidity/4294967296/10/0/1/iterations:8/manual_time                 16.8 ms         16.9 ms            8 bytes_per_second=506.475G/s
ChunkedPackStrings/4Gb4ColsValidity/1073741824/4/0/1/iterations:8/manual_time                   4.42 ms         4.44 ms            8 bytes_per_second=516.444G/s
ChunkedPackStrings/1Gb512ColsNoValidity/1073741824/512/0/0/iterations:8/manual_time             6.72 ms         6.74 ms            8 bytes_per_second=298.103G/s
ChunkedPackStrings/1Gb512ColsValidity/1073741824/512/0/1/iterations:8/manual_time               4.75 ms         4.77 ms            8 bytes_per_second=427.759G/s
ChunkedPackStrings/1Gb10ColsNoValidity/1073741824/10/0/0/iterations:8/manual_time               6.15 ms         6.43 ms            8 bytes_per_second=341.407G/s
ChunkedPackStrings/1Gb10ColsValidity/1073741824/10/0/1/iterations:8/manual_time                 4.31 ms         4.35 ms            8 bytes_per_second=494.412G/s
ChunkedPackStrings/1Gb1ColValidity/1073741824/1/0/1/iterations:8/manual_time                    4.15 ms         4.18 ms            8 bytes_per_second=730.982G/s

Benchmark results before this PR

----------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                          Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------------------------------
ContiguousSplit/6Gb512ColsNoValidity/6442450944/512/256/0/iterations:8/manual_time              57.4 ms         57.4 ms            8 bytes_per_second=209.192G/s
ContiguousSplit/6Gb512ColsValidity/6442450944/512/256/1/iterations:8/manual_time                59.3 ms         59.4 ms            8 bytes_per_second=208.633G/s
ContiguousSplit/6Gb10ColsNoValidity/6442450944/10/256/0/iterations:8/manual_time                29.7 ms         29.7 ms            8 bytes_per_second=404.623G/s
ContiguousSplit/6Gb10ColsValidity/6442450944/10/256/1/iterations:8/manual_time                  29.1 ms         29.1 ms            8 bytes_per_second=425.948G/s
ContiguousSplit/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time              47.5 ms         47.6 ms            8 bytes_per_second=168.256G/s
ContiguousSplit/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time                50.2 ms         50.3 ms            8 bytes_per_second=164.325G/s
ContiguousSplit/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time                21.2 ms         21.3 ms            8 bytes_per_second=376.781G/s
ContiguousSplit/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time                  21.8 ms         21.9 ms            8 bytes_per_second=378.616G/s
ContiguousSplit/4Gb4ColsNoSplits/1073741824/4/0/1/iterations:8/manual_time                      4.73 ms         4.81 ms            8 bytes_per_second=436.355G/s
ContiguousSplit/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time              4.69 ms         4.75 ms            8 bytes_per_second=440.121G/s
ContiguousSplit/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time              32.7 ms         32.8 ms            8 bytes_per_second=61.1839G/s
ContiguousSplit/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time                35.1 ms         35.2 ms            8 bytes_per_second=58.6784G/s
ContiguousSplit/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time                5.65 ms         5.68 ms            8 bytes_per_second=353.678G/s
ContiguousSplit/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time                  5.93 ms         5.96 ms            8 bytes_per_second=347.818G/s
ContiguousSplit/1Gb1ColNoSplits/1073741824/1/0/1/iterations:8/manual_time                       4.39 ms         4.42 ms            8 bytes_per_second=469.332G/s
ContiguousSplit/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time               4.39 ms         4.41 ms            8 bytes_per_second=469.621G/s
ContiguousSplitStrings/4Gb512ColsNoValidity/4294967296/512/256/0/iterations:8/manual_time        153 ms          153 ms            8 bytes_per_second=52.3405G/s
ContiguousSplitStrings/4Gb512ColsValidity/4294967296/512/256/1/iterations:8/manual_time          145 ms          145 ms            8 bytes_per_second=56.2481G/s
ContiguousSplitStrings/4Gb10ColsNoValidity/4294967296/10/256/0/iterations:8/manual_time         33.2 ms         34.3 ms            8 bytes_per_second=252.873G/s
ContiguousSplitStrings/4Gb10ColsValidity/4294967296/10/256/1/iterations:8/manual_time           22.8 ms         22.9 ms            8 bytes_per_second=373.402G/s
ContiguousSplitStrings/4Gb4ColsNoSplits/1073741824/4/0/0/iterations:8/manual_time               6.66 ms         7.32 ms            8 bytes_per_second=337.598G/s
ContiguousSplitStrings/4Gb4ColsValidityNoSplits/1073741824/4/0/1/iterations:8/manual_time       4.35 ms         4.38 ms            8 bytes_per_second=523.833G/s
ContiguousSplitStrings/1Gb512ColsNoValidity/1073741824/512/256/0/iterations:8/manual_time        138 ms          138 ms            8 bytes_per_second=14.4633G/s
ContiguousSplitStrings/1Gb512ColsValidity/1073741824/512/256/1/iterations:8/manual_time          131 ms          131 ms            8 bytes_per_second=15.5614G/s
ContiguousSplitStrings/1Gb10ColsNoValidity/1073741824/10/256/0/iterations:8/manual_time         10.1 ms         10.4 ms            8 bytes_per_second=207.945G/s
ContiguousSplitStrings/1Gb10ColsValidity/1073741824/10/256/1/iterations:8/manual_time           8.31 ms         8.36 ms            8 bytes_per_second=256.333G/s
ContiguousSplitStrings/1Gb1ColNoSplits/1073741824/1/0/0/iterations:8/manual_time                6.96 ms         9.71 ms            8 bytes_per_second=430.994G/s
ContiguousSplitStrings/1Gb1ColValidityNoSplits/1073741824/1/0/1/iterations:8/manual_time        4.40 ms         4.43 ms            8 bytes_per_second=688.462G/s

cpp/benchmarks/copying/contiguous_split.cu

nvdbaranec

First wave of comments.

cpp/include/cudf/contiguous_split.hpp

cpp/src/copying/contiguous_split.cu

abellina · 2023-05-05T15:51:30Z

Thanks @nvdbaranec @ttnghia, will address this afternoon :)

…pack

cpp/src/copying/contiguous_split.cu

abellina · 2023-05-09T17:25:10Z

@ttnghia @nvdbaranec @davidwendt let me know if you have further comments. Thanks!

vyasr

First pass of review, mostly high-level comments and minor editorial fixes. Will take another pass tomorrow to fully wrap my head around the algorithm.

vyasr · 2023-05-16T21:48:49Z

cpp/include/cudf/contiguous_split.hpp

+  [[nodiscard]] bool has_next() const;
+
+  /**
+   * @brief Packs the next chunk into `user_buffer`. This should be call as long as


Suggested change

* @brief Packs the next chunk into `user_buffer`. This should be call as long as

* @brief Packs the next chunk into `user_buffer`. This should be called as long as

vyasr · 2023-05-16T21:51:00Z

cpp/include/cudf/contiguous_split.hpp

+ * The intent of this operation is to be used in a streamed fashion at times of GPU
+ * out-of-memory, where we want to minimize the number of small cudaMemcpy calls and
+ * tracking of all the metadata associated with cudf tables. Because of the memory constraints,
+ * all thrust and scratch memory allocations are using the passed-in memory resource exclusively,


Even with documentation, I think this behavior is unexpected given the general approach that libcudf takes. I understand why we might need to support this, but I would prefer it be communicated very clearly in the API and not rely solely on documentation. Can we modify the struct to have two memory resources, the usual mr and an extra temp_mr or so? They can both default to the per-device resource. That way the caller gets a very clear indication that the memory resource used for temp allocations is also controllable in this particular instance.

If all allocations made by this function are temporary because it's just splitting an existing buffer, then maybe just renaming the mr makes sense.

chunked_pack never allocates result buffers. So I am going to replace mr in this api with temp_mr to make it clear, as you suggest

vyasr · 2023-05-16T21:52:14Z

cpp/include/cudf/contiguous_split.hpp

+    rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
+
+  /**
+   * @brief Destructor that will be implemented as default, required because


Suggested change

* @brief Destructor that will be implemented as default, required because

* @brief Destructor that will be implemented as default. Declared with definition here because

vyasr · 2023-05-16T22:33:35Z

cpp/src/copying/contiguous_split.cu

+  CUDF_EXPECTS(user_buffer_size >= desired_batch_size,
+               "The output buffer size must be at least 1MB in size");


Why isn't this error in the constructor?

vyasr · 2023-05-16T22:37:49Z

cpp/src/copying/contiguous_split.cu

+ * return true when there is work to be done, and false otherwise.
+ *
+ * contiguous_split_state::contiguous_split() performs a single-pass contiguous_split
+ * and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.


Only is redundant with "iff".

Suggested change

* and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.

* and is valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.

Unless "iff" is a typo, in which case use

Suggested change

* and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.

* and is only valid if contiguous_split_state is instantiated with 0 for the user_buffer_size.

vyasr · 2023-05-16T22:59:25Z

cpp/src/copying/contiguous_split.cu

+      current_iteration(0),
+      starting_batch(0),


Minor nit

Suggested change

current_iteration(0),

starting_batch(0),

current_iteration{0},

starting_batch{0},

vyasr · 2023-05-16T23:01:11Z

cpp/src/copying/contiguous_split.cu

+    CUDF_EXPECTS(current_iteration < num_iterations,
+                 "current_iteration cannot exceed num_iterations");


Can an external caller ever trigger this exception, or would this be purely an internal error? If the latter, maybe not something to throw an exception for, but more like an assert?

Hmm I guess this could happen if someone tried to advance the iteration on the state object without first checking if it has any chunks left?

Yeap, that's what it is protecting against.

vyasr · 2023-05-16T23:02:11Z

cpp/src/copying/contiguous_split.cu

+    return std::make_pair(starting_batch, count_for_current);
+  }
+
+  std::size_t advance_iteration()


Could we add docs for these functions (this one and those below)? Don't need full doxygen, a simple description of inputs/output.

vyasr · 2023-05-16T23:05:32Z

cpp/src/copying/contiguous_split.cu

    auto keys = cudf::detail::make_counting_transform_iterator(
      0, split_key_functor{static_cast<int>(num_src_bufs)});
    auto values =
      cudf::detail::make_counting_transform_iterator(0, buf_size_functor{d_dst_buf_info});

-    thrust::reduce_by_key(rmm::exec_policy(stream),
+    thrust::reduce_by_key(rmm::exec_policy(stream, mr),


I made exactly the same comment independently in the header file. Adding a separate temp_mr was my proposal.

vyasr · 2023-05-16T23:09:37Z

cpp/src/copying/contiguous_split.cu

+  std::vector<std::size_t> const h_size_of_buffs_per_iteration;
+};
+
+std::unique_ptr<chunk_iteration_state> make_chunk_iteration_state(


An alternate solution here that would satisfy both requests is to define the function out of band, i.e.

class chunk_iteration_state { ... static auto create(...); ... } static auto chunk_iteration_state::create(...) {...}

WDYT?

…onstness issues

abellina · 2023-05-17T13:32:21Z

@vyasr I took care of all the comments with: dae5ceb

cpp/src/copying/contiguous_split.cu

davidwendt

I know some of this code is just moved around so these are just suggestions for improvement.

cpp/src/copying/contiguous_split.cu

Co-authored-by: David Wendt <[email protected]>

…pack

abellina · 2023-05-17T19:38:22Z

I think I have addressed all issues known to me @vyasr and @davidwendt let me know if you have other things you want me to tackle.

vyasr

Some suggestions for improvement and a couple of questions. I didn't appreciate at first how much of the code is simply moved around rather than new, so I won't push for too many further changes here. I think we can merge this and iterate further later if desired.

I personally find some of the organization confusing. For instance, it's not clear when constructors do a lot of work, when they just set up pointers to preexisting allocations, or when they do nearly nothing. It's also not immediately obvious when things are implemented as methods or not, and which pieces of state are getting passed around (there are lots of struct members being accessed directly then handed off to some other algorithm). I believe I've made sense of it after a longer read-through, but I'm sure I'm missing lots of details. Would be good to come back and try to clarify more of the structure.

cpp/src/copying/contiguous_split.cu

vyasr · 2023-05-17T23:28:44Z

cpp/src/copying/contiguous_split.cu

+  std::unique_ptr<chunk_iteration_state> chunk_iter_state;
+
+  // Two modes are allowed:
+  //  - user provided buffer: as the name implies, the user has provided a buffer that must be at


In this context a user is not a libcudf user, but rather libcudf code making use of this struct, right? In particular, contiguous_split vs chunked_pack?

That is correct. I will clarify this comment, it was written before we started calling this chunked_pack.. before this it was chunked_contiguous_split which was all kinds of confusing :)

vyasr · 2023-05-17T23:30:37Z

cpp/src/copying/contiguous_split.cu

+    return result;
+  }
+
+  cudf::table_view const input;


These should be documented using the standard doxygen style for members, see the example in our dev docs. For the members with very long descriptions maybe use a doxygen-style short description and a separate comment for the longer explanation.

Added to chunked_iteration_state and to contiguous_split_state. I didn't add for the other structs in the anonymous namespace for now, but can go back to it if all member should get this treatment.

cpp/src/copying/contiguous_split.cu

vyasr · 2023-05-17T23:43:29Z

cpp/src/copying/contiguous_split.cu

-        max(std::size_t{1}, util::round_up_unsafe(bytes, desired_chunk_size) / desired_chunk_size);
+// packed block of memory 1: split indices and src_buf_info structs
+struct packed_split_indices_and_src_buf_info {
+  explicit packed_split_indices_and_src_buf_info(cudf::table_view const& input,


Is there a reason this constructor is marked explicit while none of the other structs are?

No reason. I think this was me not understanding explicit.

What are the rules for explicit in cuDF?

As far as I know, in general, the idea is to make sure the constructor is invoked, well, explicitly, rather than via a conversion. With multiple-argument constructors, I think the only conversion one would "protect" against is that via the initializer list. But I think initializer lists are preferred, so by marking constructors explicit I am probably going against that, forcing the constructor to be invoked directly. Is my understanding correct?

I removed the explicit here to make them consistent, but I am curious on the above for my own edification.

You're understanding seems correct to me: https://en.cppreference.com/w/cpp/language/explicit
Given that this an internal class I don't think there will be much chance of ambiguity between the ctor and an arbitrary init list.

…faults to the default memory resource

…er private functions. Make compute_batches a free function

abellina · 2023-05-18T18:15:36Z

Thanks @nvdbaranec @ttnghia @davidwendt @vyasr for all the reviews!

I'll be able to revisit this code base soon with #13332, so I am going to continue to iterate to handle the feedback that was not addressed in this PR.

abellina · 2023-05-18T18:17:54Z

/merge

This PR is the standalone JNI side of #13260. Therefore it doesn't build in as is, but I am putting this up as draft for the Java reviewers to start taking a look. This implements a `ChunkedPack` java class mirroring the interface of `cudf::chunked_pack`. This is an iterator-like class that can be used to invoke `cudf::pack` (aka `cudf::contigous_split` without splits) over several iterations against a bounce buffer. In order to create a `ChunkedPack`, the user calls `makeChunkedPack` from a `Table` instance. During this call the user can also pass an `RmmDeviceMemoryResource` to be used internally by `cudf::chunked_pack` exclusively for scratch/temporary data. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) URL: #13278

Chunked pack implementation

a8ae8cc

Co-authored-by: Dave Baranec <[email protected]> Signed-off-by: Alessandro Bellina <[email protected]>

abellina requested a review from a team as a code owner May 1, 2023 21:40

abellina requested review from vyasr and davidwendt May 1, 2023 21:40

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 1, 2023

abellina changed the title ~~Chunked pack implementation~~ Implement a chunked_pack API May 1, 2023

abellina added feature request New feature or request Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels May 1, 2023

abellina mentioned this pull request May 1, 2023

[FEA] Implement chunked_pack #13180

Closed

4 tasks

Fix code style

3ea2844

abellina commented May 1, 2023

View reviewed changes

cpp/include/cudf/detail/contiguous_split.hpp Outdated Show resolved Hide resolved

Merge branch 'branch-23.06' into chunked_pack

2201de5

abellina mentioned this pull request May 2, 2023

JNI api for cudf::chunked_pack #13278

Merged

3 tasks

ttnghia reviewed May 2, 2023

View reviewed changes

cpp/benchmarks/copying/contiguous_split.cu Outdated Show resolved Hide resolved

ttnghia reviewed May 2, 2023

View reviewed changes

cpp/benchmarks/copying/contiguous_split.cu Show resolved Hide resolved

nvdbaranec reviewed May 4, 2023

View reviewed changes

abellina added 6 commits May 6, 2023 06:19

Const fixes. Structured binding. Use buffer instead of bounce buffer

fe3bd77

Use rmm::device_uvector instead of rmm::device_buffer in benchmark

4a4da2f

Address review comments

bd87dd5

Merge branch 'chunked_pack' of github.com:abellina/cudf into chunked_…

4aae464

…pack

Fix conflicts in upmerge to 23.06

bd389ae

Add more comments in the contiguous_split_state constructor

3b83b03

abellina commented May 8, 2023

View reviewed changes

cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved

abellina added 3 commits May 8, 2023 22:53

Use std::get<> and remove commented out code

425a313

Merge branch 'branch-23.06' into chunked_pack

f7b18d3

fix styles

cc9acb4

GregoryKimball assigned davidwendt May 16, 2023

davidwendt assigned abellina and unassigned davidwendt May 16, 2023

Merge branch 'branch-23.06' into chunked_pack

a9dfbc7

vyasr requested changes May 16, 2023

View reviewed changes

abellina added 2 commits May 16, 2023 23:03

Remove extra header in split_tests.cpp

7a73d57

Introduce temp_mr. Comment edits and some more initializer list and c…

dae5ceb

…onstness issues

bif_shift -> bit_shift

ff6ce21

davidwendt reviewed May 17, 2023

View reviewed changes

cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved

davidwendt reviewed May 17, 2023

View reviewed changes

cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved

cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved

cpp/src/copying/contiguous_split.cu Outdated Show resolved Hide resolved

abellina and others added 6 commits May 17, 2023 12:52

Leverage make_device_uvector_async

b47f2c7

Remove extra brackets and use range-based for loop

b1fba4e

Co-authored-by: David Wendt <[email protected]>

Merge branch 'chunked_pack' of github.com:abellina/cudf into chunked_…

5ef8bd3

…pack

Remove extra cast and use .front

ab98d70

Fix styles

8ec29fa

Merge branch 'branch-23.06' into chunked_pack

f13156e

davidwendt approved these changes May 17, 2023

View reviewed changes

vyasr approved these changes May 17, 2023

View reviewed changes

abellina added 6 commits May 18, 2023 10:07

Move temp_mr comment to chunked_pack::create and make sure temp_mr de…

d959039

…faults to the default memory resource

Make contiguous_split_state constructor private and along with it oth…

e16306c

…er private functions. Make compute_batches a free function

Fix comment as per review

b8dcff8

Add doxygen coments in contiguous_split_state and chunk_iteration_state

a41db75

Make packed_split_indices_and_src_buf_info not explicit

bf208b4

Merge branch 'branch-23.06' into chunked_pack

3a3e9cd

rapids-bot bot merged commit c601b83 into rapidsai:branch-23.06 May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a chunked_pack API #13260

Implement a chunked_pack API #13260

abellina commented May 1, 2023 •

edited

Loading

abellina commented May 1, 2023

nvdbaranec left a comment

abellina commented May 5, 2023

abellina commented May 9, 2023

vyasr left a comment

vyasr May 16, 2023

vyasr May 16, 2023

abellina May 17, 2023

vyasr May 16, 2023

vyasr May 16, 2023

vyasr May 16, 2023

vyasr May 16, 2023

vyasr May 16, 2023

vyasr May 16, 2023

abellina May 17, 2023

vyasr May 16, 2023

vyasr May 16, 2023

vyasr May 16, 2023

abellina commented May 17, 2023

davidwendt left a comment

abellina commented May 17, 2023

vyasr left a comment

vyasr May 17, 2023

abellina May 18, 2023

vyasr May 17, 2023

abellina May 18, 2023

vyasr May 17, 2023

abellina May 18, 2023

abellina May 18, 2023 •

edited

Loading

davidwendt May 18, 2023

abellina commented May 18, 2023

abellina commented May 18, 2023

	* @brief Packs the next chunk into `user_buffer`. This should be call as long as
	* @brief Packs the next chunk into `user_buffer`. This should be called as long as

	* @brief Destructor that will be implemented as default, required because
	* @brief Destructor that will be implemented as default. Declared with definition here because

		CUDF_EXPECTS(user_buffer_size >= desired_batch_size,
		"The output buffer size must be at least 1MB in size");

	* and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.
	* and is valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.

	* and is only valid iff contiguous_split_state is instantiated with 0 for the user_buffer_size.
	* and is only valid if contiguous_split_state is instantiated with 0 for the user_buffer_size.

		CUDF_EXPECTS(current_iteration < num_iterations,
		"current_iteration cannot exceed num_iterations");

Implement a chunked_pack API #13260

Implement a chunked_pack API #13260

Conversation

abellina commented May 1, 2023 • edited Loading

Description

Checklist

abellina commented May 1, 2023

nvdbaranec left a comment

Choose a reason for hiding this comment

abellina commented May 5, 2023

abellina commented May 9, 2023

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented May 17, 2023

davidwendt left a comment

Choose a reason for hiding this comment

abellina commented May 17, 2023

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented May 18, 2023

abellina commented May 18, 2023

abellina commented May 1, 2023 •

edited

Loading

abellina May 18, 2023 •

edited

Loading