Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor contiguous_split API into contiguous_split.hpp #13186

Merged
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
fe4e1ca
Move contiguous split declarations to contiguous_split.hpp
abellina Apr 20, 2023
1d03252
Fix codestyle issues
abellina Apr 20, 2023
edfdb07
Add new headers to conda/recipes/libcudf/meta.yaml
abellina Apr 20, 2023
36b837c
Forgot to add contiguous_split.pxd
abellina Apr 20, 2023
055ad37
Fix copyright range
abellina Apr 20, 2023
8f896f6
Add contiguous_split.hpp to doxygen group
abellina Apr 20, 2023
892d947
Remove unnecesary imports in contiguous_split.pxd
abellina Apr 20, 2023
22cb00f
Merge branch 'branch-23.06' into refactor_contig_split_headers
ttnghia Apr 20, 2023
37da022
Merge branch 'branch-23.06' into refactor_contig_split_headers
abellina Apr 21, 2023
96f36d9
Favor metadata struct wrapper
abellina Apr 22, 2023
a03907c
Fixes to address Param/function header feedback
abellina Apr 24, 2023
5619bc0
Merge branch 'refactor_contig_split_headers' of github.com:abellina/c…
abellina Apr 24, 2023
6dca7c0
@param[in] -> @param
abellina Apr 24, 2023
f0cbdfd
Merge branch 'branch-23.06' of github.com:rapidsai/cudf into refactor…
abellina Apr 24, 2023
71d4645
Fix python build after removal of struct metadata
abellina Apr 24, 2023
d8dd746
Remove metadata type in favor of just std::vector<uint8_t>
abellina Apr 24, 2023
cc24e58
Commit stylechecks/copyright fixes
abellina Apr 24, 2023
7a495e8
Fix more clang-format issues
abellina Apr 24, 2023
17940a4
metadata_ -> metadata
abellina Apr 25, 2023
3722a80
Merge branch 'branch-23.06' of github.com:rapidsai/cudf into refactor…
abellina Apr 25, 2023
777d9f9
Merge branch 'branch-23.06' into refactor_contig_split_headers
abellina Apr 25, 2023
5e1ed6b
Merge branch 'branch-23.06' into refactor_contig_split_headers
abellina Apr 25, 2023
aabf6a9
Merge branch 'branch-23.06' into refactor_contig_split_headers
abellina Apr 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ outputs:
- test -f $PREFIX/include/cudf/column/column_factories.hpp
- test -f $PREFIX/include/cudf/column/column_view.hpp
- test -f $PREFIX/include/cudf/concatenate.hpp
- test -f $PREFIX/include/cudf/contiguous_split.hpp
- test -f $PREFIX/include/cudf/copying.hpp
- test -f $PREFIX/include/cudf/datetime.hpp
- test -f $PREFIX/include/cudf/timezone.hpp
Expand All @@ -99,6 +100,7 @@ outputs:
- test -f $PREFIX/include/cudf/detail/binaryop.hpp
- test -f $PREFIX/include/cudf/detail/calendrical_month_sequence.cuh
- test -f $PREFIX/include/cudf/detail/concatenate.hpp
- test -f $PREFIX/include/cudf/detail/contiguous_split.hpp
- test -f $PREFIX/include/cudf/detail/copy.hpp
- test -f $PREFIX/include/cudf/detail/datetime.hpp
- test -f $PREFIX/include/cudf/detail/fill.hpp
Expand Down
2 changes: 1 addition & 1 deletion cpp/benchmarks/copying/contiguous_split.cu
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
#include <cudf_test/column_wrapper.hpp>

#include <cudf/column/column.hpp>
#include <cudf/copying.hpp>
#include <cudf/contiguous_split.hpp>

#include <thrust/iterator/counting_iterator.h>

Expand Down
229 changes: 229 additions & 0 deletions cpp/include/cudf/contiguous_split.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cudf/table/table.hpp>
#include <cudf/types.hpp>

#include <memory>
#include <vector>

namespace cudf {

/**
* @addtogroup column_copy
* @{
* @file
* @brief Table APIs for contiguous_split, pack, unpack, and metadadata
*/

/**
* @brief Column data in a serialized format
*
* @ingroup copy_split
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
*
* Contains data from an array of columns in two contiguous buffers: one on host, which contains
* table metadata and one on device which contains the table data.
*/
struct packed_columns {
/**
* @brief Host-side metadata buffer used for reconstructing columns via unpack.
*
* @ingroup copy_split
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
*/
struct metadata {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no reason to have this struct. It just wraps around a std::vector<uint8_t> so can we just use std::vector<uint8_t> directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just moved this code, but I see the point. I'll push a commit to remove the struct in favor of a using, which means the rest of the code that wants to use metadata as a type can still do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I broke the python build with this. I think all I need to do is define the type using ctypedef, I am working on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just remove this type completely and fix the python side, such as removing cdef struct metadata, replacing metadata with just vector[uint8_t].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be handled now with d8dd746

metadata() = default;

/**
* @brief Construct a new metadata object
*
* @param v Host-side buffer containing metadata
*/
metadata(std::vector<uint8_t>&& v) : data_(std::move(v)) {}

/**
* @brief Returns pointer to the host-side metadata buffer data
*
* @return Pointer to the host-side metadata buffer
*/
[[nodiscard]] uint8_t const* data() const { return data_.data(); }

/**
* @brief Returns size of the metadata buffer
*
* @return Size of the metadata buffer
*/
[[nodiscard]] size_t size() const { return data_.size(); }

private:
std::vector<uint8_t> data_;
};

packed_columns()
: metadata_(std::make_unique<metadata>()), gpu_data(std::make_unique<rmm::device_buffer>())
{
}

/**
* @brief Construct a new packed columns object
*
* @param md Host-side metadata buffer
* @param gd Device-side data buffer
*/
packed_columns(std::unique_ptr<metadata>&& md, std::unique_ptr<rmm::device_buffer>&& gd)
: metadata_(std::move(md)), gpu_data(std::move(gd))
{
}

std::unique_ptr<metadata> metadata_; ///< Host-side metadata buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::unique_ptr<metadata> metadata_; ///< Host-side metadata buffer
std::vector<uint8_t> metadata_; ///< Host-side metadata buffer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, recommend to use std::vector<std::byte> instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttnghia you marked this as resolved, but I haven't pushed it. Do you want me stick with uint8_t? I have been making the change locally and the main thing is I have to change other files in java/python and other source files on the cpp side. If we are OK with the bigger diff, I can spend time on moving us to this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, accidentally marked as resolved 😄 . Yes, please address this.

Copy link
Contributor Author

@abellina abellina Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried this, but it doesn't seem std::byte is defined in cython:

cdef extern from "cudf/contiguous_split.hpp" namespace \
        "cudf::packed_columns" nogil:
    ctypedef vector[byte] metadata 
                       ^
------------------------------------------------------------

cudf/python/cudf/cudf/_lib/cpp/contiguous_split.pxd:16:24: unknown type in template argument

Looking at stddef in cython, I see they didn't exposed that type: https://github.com/cython/cython/blob/1aef71764cd50e4cd2f638eac8e69fff3ced2224/Cython/Includes/libc/stddef.pxd

Should we leave as uint8_t? Or should I handle this differently? @ttnghia

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttnghia I did remove the typedef for metadata and went directly with the std::vector<uint8_t> here: d8dd746

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping uint8_t is fine. I want to move forward with a more standard way that libcudf devs prefer (std::byte instead of other types) but understand that it is may be out of scope of this PR.

std::unique_ptr<rmm::device_buffer> gpu_data; ///< Device-side data buffer
abellina marked this conversation as resolved.
Show resolved Hide resolved
};

/**
* @brief The result(s) of a cudf::contiguous_split
*
* @ingroup copy_split
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
*
* Each table_view resulting from a split operation performed by contiguous_split,
* will be returned wrapped in a `packed_table`. The table_view and internal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* will be returned wrapped in a `packed_table`. The table_view and internal
* will be returned wrapped in a `packed_table`. The table_view and internal

* column_views in this struct are not owned by a top level cudf::table or cudf::column.
* The backing memory and metadata is instead owned by the `data` field and is in one
* contiguous block.
*
* The user is responsible for assuring that the `table` or any derived table_views do
* not outlive the memory owned by `data`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* not outlive the memory owned by `data`
* not outlive the memory owned by `data`.

*/
struct packed_table {
cudf::table_view table; ///< Result table_view of a cudf::contiguous_split
packed_columns data; ///< Column data owned
};

/**
* @brief Performs a deep-copy split of a `table_view` into a set of `table_view`s into a single
Copy link
Contributor

@ttnghia ttnghia Apr 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Suggested change
* @brief Performs a deep-copy split of a `table_view` into a set of `table_view`s into a single
* @brief Performs a deep-copy split of a `table_view` into a single

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this was not clear. I rewarded it (will push shortly)

* contiguous block of memory.
*
* @ingroup copy_split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @ingroup copy_split

*
* The memory for the output views is allocated in a single contiguous `rmm::device_buffer` returned
* in the `packed_table`. There is no top-level owning table.
*
* The returned views of `input` are constructed from a vector of indices, that indicate
* where each split should occur. The `i`th returned `table_view` is sliced as
* `[0, splits[i])` if `i`=0, else `[splits[i], input.size())` if `i` is the last view and
* `[splits[i-1], splits[i]]` otherwise.
*
* For all `i` it is expected `splits[i] <= splits[i+1] <= input.size()`
* For a `splits` size N, there will always be N+1 splits in the output
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* For all `i` it is expected `splits[i] <= splits[i+1] <= input.size()`
* For a `splits` size N, there will always be N+1 splits in the output
*
* For all `i` it is expected `splits[i] <= splits[i+1] <= input.size()`.
* For a `splits` size N, there will always be N+1 splits in the output.
*

* @note It is the caller's responsibility to ensure that the returned views
* do not outlive the viewed device memory contained in the `all_data` field of the
* returned packed_table.
*
* @code{.pseudo}
* Example:
* input: [{10, 12, 14, 16, 18, 20, 22, 24, 26, 28},
* {50, 52, 54, 56, 58, 60, 62, 64, 66, 68}]
* splits: {2, 5, 9}
* output: [{{10, 12}, {14, 16, 18}, {20, 22, 24, 26}, {28}},
* {{50, 52}, {54, 56, 58}, {60, 62, 64, 66}, {68}}]
* @endcode
*
*
* @throws cudf::logic_error if `splits` has end index > size of `input`.
* @throws cudf::logic_error When the value in `splits` is not in the range [0, input.size()).
* @throws cudf::logic_error When the values in the `splits` are 'strictly decreasing'.
*
* @param input View of a table to split
* @param splits A vector of indices where the view will be split
* @param[in] mr Device memory resource used to allocate the returned result's device memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param[in] mr Device memory resource used to allocate the returned result's device memory
* @param mr Device memory resource used to allocate the returned result's device memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about @param[in]. In this case, it is inconsistently used (e.g. input and splits should also be @param[in]). The reason I bring this up is because I see lots of usage of @param[in] in cuDF, and was wondering if this is going out of favor (being replaced by just @param)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we drop the [in] especially when all parameters are [in]
The [out] is relatively rare and should be used to clear up any ambiguity.
https://github.com/rapidsai/cudf/blob/branch-23.06/cpp/doxygen/developer_guide/DOCUMENTATION.md#param

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable

* @return The set of requested views of `input` indicated by the `splits` and the viewed memory
* buffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use period . inconsistently.

Suggested change
* buffer.
* buffer

*/
std::vector<packed_table> contiguous_split(
cudf::table_view const& input,
std::vector<size_type> const& splits,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Deep-copy a `table_view` into a serialized contiguous memory format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Deep-copy a `table_view` into a serialized contiguous memory format
* @brief Deep-copy a `table_view` into a serialized contiguous memory format.

*
* The metadata from the `table_view` is copied into a host vector of bytes and the data from the
* `table_view` is copied into a `device_buffer`. Pass the output of this function into
* `cudf::unpack` to deserialize.
*
* @param input View of the table to pack
* @param[in] mr Optional, The resource to use for all returned device allocations
Copy link
Contributor

@ttnghia ttnghia Apr 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param[in] mr Optional, The resource to use for all returned device allocations
* @param mr Optional, the resource to use for all returned device allocations

* @return packed_columns A struct containing the serialized metadata and data in contiguous host
* and device memory respectively
*/
packed_columns pack(cudf::table_view const& input,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Produce the metadata used for packing a table stored in a contiguous buffer.
*
* The metadata from the `table_view` is copied into a host vector of bytes which can be used to
* construct a `packed_columns` or `packed_table` structure. The caller is responsible for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* construct a `packed_columns` or `packed_table` structure. The caller is responsible for
* construct a `packed_columns` or `packed_table` structure. The caller is responsible for

* guaranteeing that that all of the columns in the table point into `contiguous_buffer`.
*
* @param table View of the table to pack
* @param contiguous_buffer A contiguous buffer of device memory which contains the data referenced
* by the columns in `table`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param contiguous_buffer A contiguous buffer of device memory which contains the data referenced
* by the columns in `table`
* @param contiguous_buffer A contiguous buffer of device memory which contains the data referenced
* by the columns in `table`

* @param buffer_size The size of `contiguous_buffer`
* @return Vector of bytes representing the metadata used to `unpack` a packed_columns struct
*/
packed_columns::metadata pack_metadata(table_view const& table,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment above, just use std::vector<uint8_t/std::byte> directly.

uint8_t const* contiguous_buffer,
size_t buffer_size);

/**
* @brief Deserialize the result of `cudf::pack`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Deserialize the result of `cudf::pack`
* @brief Deserialize the result of `cudf::pack`.

*
* Converts the result of a serialized table into a `table_view` that points to the data stored in
* the contiguous device buffer contained in `input`.
*
* It is the caller's responsibility to ensure that the `table_view` in the output does not outlive
* the data in the input.
*
* No new device memory is allocated in this function.
*
* @param input The packed columns to unpack
* @return The unpacked `table_view`
*/
table_view unpack(packed_columns const& input);

/**
* @brief Deserialize the result of `cudf::pack`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Deserialize the result of `cudf::pack`
* @brief Deserialize the result of `cudf::pack`.

*
* Converts the result of a serialized table into a `table_view` that points to the data stored in
* the contiguous device buffer contained in `gpu_data` using the metadata contained in the host
* buffer `metadata`.
*
* It is the caller's responsibility to ensure that the `table_view` in the output does not outlive
* the data in the input.
*
* No new device memory is allocated in this function.
*
* @param metadata The host-side metadata buffer resulting from the initial pack() call
* @param gpu_data The device-side contiguous buffer storing the data that will be referenced by
* the resulting `table_view`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param gpu_data The device-side contiguous buffer storing the data that will be referenced by
* the resulting `table_view`
* @param gpu_data The device-side contiguous buffer storing the data that will be referenced by
* the resulting `table_view`

* @return The unpacked `table_view`
*/
table_view unpack(uint8_t const* metadata, uint8_t const* gpu_data);

/** @} */
} // namespace cudf
Loading