Skip to content

Commit

Permalink
Update libcudf counting functions to specify cudf::size_type (#12904)
Browse files Browse the repository at this point in the history
Adds section to developer guide about `cudf::size_type` and adds links to it from other relevant parts of the document.
The fundamental nature of this type seems important enough to mention in the developer guide since it is the basis for how much of the code is designed and implemented.
Also updates some doxygen for public APIs that are return `size_type` column values but had cited `INT32` specifically.

Reference: #12779 (comment)

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Lawrence Mitchell (https://github.com/wence-)
  - Yunsong Wang (https://github.com/PointKernel)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #12904
  • Loading branch information
davidwendt authored Mar 15, 2023
1 parent ced3fdf commit 7776e0e
Show file tree
Hide file tree
Showing 13 changed files with 95 additions and 106 deletions.
23 changes: 15 additions & 8 deletions cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,13 @@ An *immutable*, non-owning view of a table.
A *mutable*, non-owning view of a table.
## cudf::size_type
The `cudf::size_type` is the type used for the number of elements in a column, offsets to elements within a column, indices to address specific elements, segments for subsets of column elements, etc.
It is equivalent to a signed, 32-bit integer type and therefore has a maximum value of 2147483647.
Some APIs also accept negative index values and those functions support a minimum value of -2147483648.
This fundamental type also influences output values not just for column size limits but for counting elements as well.
## Spans
libcudf provides `span` classes that mimic C++20 `std::span`, which is a lightweight
Expand Down Expand Up @@ -370,16 +377,16 @@ libcudf APIs should still perform any validation that does not require introspec
To give some idea of what should or should not be validated, here are (non-exhaustive) lists of examples.

**Things that libcudf should validate**:
- Input column/table sizes or dtypes
- Input column/table sizes or data types

**Things that libcudf should not validate**:
- Integer overflow
- Ensuring that outputs will not exceed the 2GB size limit for a given set of inputs
- Ensuring that outputs will not exceed the [2GB size](#cudfsize_type) limit for a given set of inputs


## libcudf expects nested types to have sanitized null masks

Various libcudf APIs accepting columns of nested dtypes (such as `LIST` or `STRUCT`) may assume that these columns have been sanitized.
Various libcudf APIs accepting columns of nested data types (such as `LIST` or `STRUCT`) may assume that these columns have been sanitized.
In this context, sanitization refers to ensuring that the null elements in a column with a nested dtype are compatible with the elements of nested columns.
Specifically:
- Null elements of list columns should also be empty. The starting offset of a null element should be equal to the ending offset.
Expand Down Expand Up @@ -746,8 +753,8 @@ where compile time was a problem is in types used to store indices, which can be
The "Indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be
used for index types (integers) without requiring a type-specific instance. It can be used for any
iterator interface for reading an array of integer values of type `int8`, `int16`, `int32`,
`int64`, `uint8`, `uint16`, `uint32`, or `uint64`. Reading specific elements always return a
`cudf::size_type` integer.
`int64`, `uint8`, `uint16`, `uint32`, or `uint64`. Reading specific elements always returns a
[`cudf::size_type`](#cudfsize_type) integer.
Use the `indexalator_factory` to create an appropriate input iterator from a column_view. Example
input iterator usage:
Expand Down Expand Up @@ -1104,7 +1111,7 @@ For list columns, the parent column's type is `LIST` and contains no data, but i
the number of lists in the column, and its null mask represents the validity of each list element.
The parent has two children.
1. A non-nullable column of `INT32` elements that indicates the offset to the beginning of each list
1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each list
in a dense column of elements.
2. A column containing the actual data and optional null mask for all elements of all the lists
packed together.
Expand Down Expand Up @@ -1152,7 +1159,7 @@ a non-nullable column of `INT8` data. The parent column's type is `STRING` and c
but its size represents the number of strings in the column, and its null mask represents the
validity of each string. To summarize, the strings column children are:
1. A non-nullable column of `INT32` elements that indicates the offset to the beginning of each
1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each
string in a dense column of all characters.
2. A non-nullable column of `INT8` elements of all the characters across all the strings packed
together.
Expand Down Expand Up @@ -1264,7 +1271,7 @@ libcudf provides view types for nested column types as well as for the data elem
`cudf::strings_column_view` is a view of a strings column, like `cudf::column_view` is a view of
any `cudf::column`. `cudf::string_view` is a view of a single string, and therefore
`cudf::string_view` is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
data type for a `cudf::column` of type `INT32`. As it's name implies, this is a read-only object
data type for a `cudf::column` of type [`size_type`](#cudfsize_type). As its name implies, this is a read-only object
instance that points to device memory inside the strings column. It's lifespan is the same (or less)
as the column it views.
Expand Down
28 changes: 14 additions & 14 deletions cpp/include/cudf/lists/contains.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ namespace lists {
*
* @param lists Lists column whose `n` rows are to be searched
* @param search_key The scalar key to be looked up in each list row
* @param mr Device memory resource used to allocate the returned column's device memory.
* @param mr Device memory resource used to allocate the returned column's device memory
* @return BOOL8 column of `n` rows with the result of the lookup
*/
std::unique_ptr<column> contains(
Expand All @@ -64,7 +64,7 @@ std::unique_ptr<column> contains(
*
* @param lists Lists column whose `n` rows are to be searched
* @param search_keys Column of elements to be looked up in each list row
* @param mr Device memory resource used to allocate the returned column's device memory.
* @param mr Device memory resource used to allocate the returned column's device memory
* @return BOOL8 column of `n` rows with the result of the lookup
*/
std::unique_ptr<column> contains(
Expand All @@ -85,7 +85,7 @@ std::unique_ptr<column> contains(
* Nulls inside non-null nested elements (such as lists or structs) are not considered.
*
* @param lists Lists column whose `n` rows are to be searched
* @param mr Device memory resource used to allocate the returned column's device memory.
* @param mr Device memory resource used to allocate the returned column's device memory
* @return BOOL8 column of `n` rows with the result of the lookup
*/
std::unique_ptr<column> contains_nulls(
Expand All @@ -102,7 +102,7 @@ enum class duplicate_find_option : int32_t {
};

/**
* @brief Create a column of `size_type` values indicating the position of a search key
* @brief Create a column of values indicating the position of a search key
* within each list row in the `lists` column
*
* The output column has as many elements as there are rows in the input `lists` column.
Expand All @@ -119,14 +119,14 @@ enum class duplicate_find_option : int32_t {
* If `find_option == FIND_LAST`, the position of the last match in the list row is
* returned.
*
* @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
*
* @param lists Lists column whose `n` rows are to be searched
* @param search_key The scalar key to be looked up in each list row
* @param find_option Whether to return the position of the first match (`FIND_FIRST`) or
* last (`FIND_LAST`)
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return INT32 column of `n` rows with the location of the `search_key`
*
* @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
* @param mr Device memory resource used to allocate the returned column's device memory
* @return column of `n` rows with the location of the `search_key`
*/
std::unique_ptr<column> index_of(
cudf::lists_column_view const& lists,
Expand All @@ -135,7 +135,7 @@ std::unique_ptr<column> index_of(
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Create a column of `size_type` values indicating the position of a search key
* @brief Create a column of values indicating the position of a search key
* row within the corresponding list row in the `lists` column
*
* The output column has as many elements as there are rows in the input `lists` column.
Expand All @@ -152,16 +152,16 @@ std::unique_ptr<column> index_of(
* If `find_option == FIND_LAST`, the position of the last match in the list row is
* returned.
*
* @throw cudf::logic_error If `search_keys` does not match `lists` in its number of rows
* @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
*
* @param lists Lists column whose `n` rows are to be searched
* @param search_keys A column of search keys to be looked up in each corresponding row of
* `lists`
* @param find_option Whether to return the position of the first match (`FIND_FIRST`) or
* last (`FIND_LAST`)
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return INT32 column of `n` rows with the location of the `search_key`
*
* @throw cudf::logic_error If `search_keys` does not match `lists` in its number of rows
* @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
* @param mr Device memory resource used to allocate the returned column's device memory
* @return column of `n` rows with the location of the `search_key`
*/
std::unique_ptr<column> index_of(
cudf::lists_column_view const& lists,
Expand Down
8 changes: 4 additions & 4 deletions cpp/include/cudf/lists/count_elements.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021-2022, NVIDIA CORPORATION.
* Copyright (c) 2021-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -44,9 +44,9 @@ namespace lists {
* Any null input element will result in a corresponding null entry
* in the output column.
*
* @param input Input lists column.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column with the number of elements for each row.
* @param input Input lists column
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New column with the number of elements for each row
*/
std::unique_ptr<column> count_elements(
lists_column_view const& input,
Expand Down
6 changes: 3 additions & 3 deletions cpp/include/cudf/search.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2022, NVIDIA CORPORATION.
* Copyright (c) 2019-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -64,7 +64,7 @@ namespace cudf {
* @param column_order Vector of column sort order
* @param null_precedence Vector of null_precedence enums needles
* @param mr Device memory resource used to allocate the returned column's device memory
* @return A non-nullable column of cudf::size_type elements containing the insertion points
* @return A non-nullable column of elements containing the insertion points
*/
std::unique_ptr<column> lower_bound(
table_view const& haystack,
Expand Down Expand Up @@ -104,7 +104,7 @@ std::unique_ptr<column> lower_bound(
* @param column_order Vector of column sort order
* @param null_precedence Vector of null_precedence enums needles
* @param mr Device memory resource used to allocate the returned column's device memory
* @return A non-nullable column of cudf::size_type elements containing the insertion points
* @return A non-nullable column of elements containing the insertion points
*/
std::unique_ptr<column> upper_bound(
table_view const& haystack,
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/sorting.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ namespace cudf {
* for each column. Size must be equal to `input.num_columns()` or empty.
* If empty, all columns will be sorted in `null_order::BEFORE`.
* @param mr Device memory resource used to allocate the returned column's device memory
* @return A non-nullable column of `size_type` elements containing the permuted row indices of
* @return A non-nullable column of elements containing the permuted row indices of
* `input` if it were sorted
*/
std::unique_ptr<column> sorted_order(
Expand Down
36 changes: 18 additions & 18 deletions cpp/include/cudf/strings/attributes.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2022, NVIDIA CORPORATION.
* Copyright (c) 2019-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -32,44 +32,44 @@ namespace strings {
*/

/**
* @brief Returns an integer numeric column containing the length of each string in
* characters.
* @brief Returns a column containing character lengths
* of each string in the given column
*
* The output column will have the same number of rows as the
* specified strings column. Each row value will be the number of
* characters in the corresponding string.
*
* Any null string will result in a null entry for that row in the output column.
*
* @param strings Strings instance for this operation.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column with lengths for each string.
* @param input Strings instance for this operation
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New column with lengths for each string
*/
std::unique_ptr<column> count_characters(
strings_column_view const& strings,
strings_column_view const& input,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a numeric column containing the length of each string in
* bytes.
* @brief Returns a column containing byte lengths
* of each string in the given column
*
* The output column will have the same number of rows as the
* specified strings column. Each row value will be the number of
* bytes in the corresponding string.
*
* Any null string will result in a null entry for that row in the output column.
*
* @param strings Strings instance for this operation.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column with the number of bytes for each string.
* @param input Strings instance for this operation
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New column with the number of bytes for each string
*/
std::unique_ptr<column> count_bytes(
strings_column_view const& strings,
strings_column_view const& input,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Creates a numeric column with code point values (integers) for each
* character of each string.
* character of each string
*
* A code point is the integer value representation of a character.
* For example, the code point value for the character 'A' in UTF-8 is 65.
Expand All @@ -79,12 +79,12 @@ std::unique_ptr<column> count_bytes(
*
* Any null string is ignored. No null entries will appear in the output column.
*
* @param strings Strings instance for this operation.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column with code point integer values for each character.
* @param input Strings instance for this operation
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New INT32 column with code point integer values for each character
*/
std::unique_ptr<column> code_points(
strings_column_view const& strings,
strings_column_view const& input,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of strings_apis group
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/strings/contains.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ std::unique_ptr<column> matches_re(
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New INT32 column with counts for each string
* @return New column of match counts for each string
*/
std::unique_ptr<column> count_re(
strings_column_view const& strings,
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/strings/detail/strings_children.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ auto make_strings_children(SizeAndExecuteFunction size_and_exec_fn,
rmm::mr::device_memory_resource* mr)
{
auto offsets_column = make_numeric_column(
data_type{type_id::INT32}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
data_type{type_to_id<size_type>()}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
auto offsets_view = offsets_column->mutable_view();
auto d_offsets = offsets_view.template data<int32_t>();
size_and_exec_fn.d_offsets = d_offsets;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ std::unique_ptr<column> make_strings_column(CharIterator chars_begin,

// build offsets column -- this is the number of strings + 1
auto offsets_column = make_numeric_column(
data_type{type_id::INT32}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
data_type{type_to_id<size_type>()}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
auto offsets_view = offsets_column->mutable_view();
thrust::transform(rmm::exec_policy(stream),
offsets_begin,
Expand Down
Loading

0 comments on commit 7776e0e

Please sign in to comment.