Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudf::row_bit_count() support. #7534

Merged
merged 17 commits into from
Mar 30, 2021
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,7 @@ add_library(cudf
src/transform/jit/code/kernel.cpp
src/transform/mask_to_bools.cu
src/transform/nans_to_nulls.cu
src/transform/row_bit_count.cu
nvdbaranec marked this conversation as resolved.
Show resolved Hide resolved
src/transform/transform.cpp
src/transpose/transpose.cu
src/unary/cast_ops.cu
Expand Down
12 changes: 11 additions & 1 deletion cpp/include/cudf/detail/transform.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -77,5 +77,15 @@ std::unique_ptr<column> mask_to_bools(
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @copydoc cudf::row_bit_count
*
* @param stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<column> row_bit_count(
table_view const& t,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

} // namespace detail
} // namespace cudf
1 change: 0 additions & 1 deletion cpp/include/cudf/lists/lists_column_view.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,6 @@ class lists_column_view : private column_view {
using column_view::null_mask;
using column_view::offset;
using column_view::size;
using offset_type = int32_t;
static_assert(std::is_same<offset_type, size_type>::value,
"offset_type is expected to be the same as size_type.");
using offset_iterator = offset_type const*;
Expand Down
37 changes: 36 additions & 1 deletion cpp/include/cudf/transform.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -142,5 +142,40 @@ std::unique_ptr<column> mask_to_bools(
size_type end_bit,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns an approximate cumulative size in bits of all columns in the `table_view` for
* each row.
*
nvdbaranec marked this conversation as resolved.
Show resolved Hide resolved
* This function counts bits instead of bytes to account for the null mask which only has one
* bit per row.
*
* Each row in the returned column is the sum of the per-row size for each column in
* the table.
*
* In some cases, this is an inexact approximation. Specifically, with
* lists or strings, the cost of a row includes 32 bits for a single offset. However, two
* offsets is required to represent an entire row. But this presents a problem, because to
nvdbaranec marked this conversation as resolved.
Show resolved Hide resolved
* represent 2 rows, you need 3 offsets. 3 rows 4 offsets, etc. Therefore it would not
* be accurate to say each row of a string column costs 2 offsets because summing multiple row
* sizes would give you a number too large. It is up to the caller to understand the schema
* of the input column to be able to calculate the small additional overhead of the
* terminating offset for any group of rows being considered.
nvdbaranec marked this conversation as resolved.
Show resolved Hide resolved
*
* This function returns the per-row sizes as the columns are currently formed. This can
* end up being different than the number you would get by gathering the rows under
* certain circumstances. Specifically, the pushdown of validity masks by struct
* columns can nullify rows that actually contain underlying data for string or list
* columns. In these cases, the sized returned will be strictly:
nvdbaranec marked this conversation as resolved.
Show resolved Hide resolved
*
* row_bit_count(column(x)) >= row_bit_count(gather(column(x)))
*
* @param t The table view to perform the computation on.
* @param mr Device memory resource used to allocate the returned columns's device memory
* @return A 32-bit integer column containing the per-row bit counts.
*/
std::unique_ptr<column> row_bit_count(
table_view const& t,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of group
} // namespace cudf
1 change: 1 addition & 0 deletions cpp/include/cudf/types.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ class mutable_table_view;
using size_type = int32_t;
using bitmask_type = uint32_t;
using valid_type = uint8_t;
using offset_type = int32_t;

/**
* @brief Similar to `std::distance` but returns `cudf::size_type` and performs `static_cast`
Expand Down
1 change: 1 addition & 0 deletions cpp/src/jit/type.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ std::string get_type_name(data_type type)
// TODO: Remove in JIT type utils PR
switch (type.id()) {
case type_id::LIST: return CUDF_STRINGIFY(List);
case type_id::STRUCT: return CUDF_STRINGIFY(Struct);
case type_id::DECIMAL32: return CUDF_STRINGIFY(int32_t);
case type_id::DECIMAL64: return CUDF_STRINGIFY(int64_t);

Expand Down
2 changes: 1 addition & 1 deletion cpp/src/lists/drop_list_duplicates.cu
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ namespace cudf {
namespace lists {
namespace detail {
namespace {
using offset_type = lists_column_view::offset_type;

/**
* @brief Copy list entries and entry list offsets ignoring duplicates
*
Expand Down
Loading