Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update libcudf counting functions to specify cudf::size_type #12904

Merged
merged 14 commits into from
Mar 15, 2023
Merged
21 changes: 14 additions & 7 deletions cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,13 @@ An *immutable*, non-owning view of a table.

A *mutable*, non-owning view of a table.

## cudf::size_type

The `cudf::size_type` is the type used for the number of elements in a column, offsets to elements within a column, indices to address specific elements, segments for subsets of column elements, etc.
It is equivalent to a signed, 32-bit integer type and therefore has a maximum value of 2147483647. Some APIs also accept negative index values and those will
functions would support a minimum value of -2147483648.
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
This fundamental type also influences output values not just for column size limits but for counting elements as well.

## Spans

libcudf provides `span` classes that mimic C++20 `std::span`, which is a lightweight
Expand Down Expand Up @@ -370,16 +377,16 @@ libcudf APIs should still perform any validation that does not require introspec
To give some idea of what should or should not be validated, here are (non-exhaustive) lists of examples.

**Things that libcudf should validate**:
- Input column/table sizes or dtypes
- Input column/table sizes or data types

**Things that libcudf should not validate**:
- Integer overflow
- Ensuring that outputs will not exceed the 2GB size limit for a given set of inputs
- Ensuring that outputs will not exceed the [2GB size](#cudfsize_type) limit for a given set of inputs


## libcudf expects nested types to have sanitized null masks

Various libcudf APIs accepting columns of nested dtypes (such as `LIST` or `STRUCT`) may assume that these columns have been sanitized.
Various libcudf APIs accepting columns of nested data types (such as `LIST` or `STRUCT`) may assume that these columns have been sanitized.
In this context, sanitization refers to ensuring that the null elements in a column with a nested dtype are compatible with the elements of nested columns.
Specifically:
- Null elements of list columns should also be empty. The starting offset of a null element should be equal to the ending offset.
Expand Down Expand Up @@ -747,7 +754,7 @@ The "Indexalator", or index-normalizing iterator (`include/cudf/detail/indexalat
used for index types (integers) without requiring a type-specific instance. It can be used for any
iterator interface for reading an array of integer values of type `int8`, `int16`, `int32`,
`int64`, `uint8`, `uint16`, `uint32`, or `uint64`. Reading specific elements always return a
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
`cudf::size_type` integer.
[`cudf::size_type`](#cudfsize_type) integer.

Use the `indexalator_factory` to create an appropriate input iterator from a column_view. Example
input iterator usage:
Expand Down Expand Up @@ -1064,7 +1071,7 @@ For list columns, the parent column's type is `LIST` and contains no data, but i
the number of lists in the column, and its null mask represents the validity of each list element.
The parent has two children.

1. A non-nullable column of `INT32` elements that indicates the offset to the beginning of each list
1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each list
in a dense column of elements.
2. A column containing the actual data and optional null mask for all elements of all the lists
packed together.
Expand Down Expand Up @@ -1112,7 +1119,7 @@ a non-nullable column of `INT8` data. The parent column's type is `STRING` and c
but its size represents the number of strings in the column, and its null mask represents the
validity of each string. To summarize, the strings column children are:

1. A non-nullable column of `INT32` elements that indicates the offset to the beginning of each
1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each
string in a dense column of all characters.
2. A non-nullable column of `INT8` elements of all the characters across all the strings packed
together.
Expand Down Expand Up @@ -1224,7 +1231,7 @@ libcudf provides view types for nested column types as well as for the data elem
`cudf::strings_column_view` is a view of a strings column, like `cudf::column_view` is a view of
any `cudf::column`. `cudf::string_view` is a view of a single string, and therefore
`cudf::string_view` is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
data type for a `cudf::column` of type `INT32`. As it's name implies, this is a read-only object
data type for a `cudf::column` of type [`size_type`](#cudfsize_type). As it's name implies, this is a read-only object
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
instance that points to device memory inside the strings column. It's lifespan is the same (or less)
as the column it views.

Expand Down
24 changes: 12 additions & 12 deletions cpp/include/cudf/lists/contains.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ namespace lists {
* @param lists Lists column whose `n` rows are to be searched
* @param search_key The scalar key to be looked up in each list row
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return std::unique_ptr<column> BOOL8 column of `n` rows with the result of the lookup
* @return BOOL8 column of `n` rows with the result of the lookup
*/
std::unique_ptr<column> contains(
cudf::lists_column_view const& lists,
Expand All @@ -65,7 +65,7 @@ std::unique_ptr<column> contains(
* @param lists Lists column whose `n` rows are to be searched
* @param search_keys Column of elements to be looked up in each list row
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return std::unique_ptr<column> BOOL8 column of `n` rows with the result of the lookup
* @return BOOL8 column of `n` rows with the result of the lookup
*/
std::unique_ptr<column> contains(
cudf::lists_column_view const& lists,
Expand All @@ -86,7 +86,7 @@ std::unique_ptr<column> contains(
*
* @param lists Lists column whose `n` rows are to be searched
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return std::unique_ptr<column> BOOL8 column of `n` rows with the result of the lookup
* @return BOOL8 column of `n` rows with the result of the lookup
*/
std::unique_ptr<column> contains_nulls(
cudf::lists_column_view const& lists,
Expand All @@ -102,7 +102,7 @@ enum class duplicate_find_option : int32_t {
};

/**
* @brief Create a column of `size_type` values indicating the position of a search key
* @brief Create a column of cudf::size_type values indicating the position of a search key
* within each list row in the `lists` column
*
* The output column has as many elements as there are rows in the input `lists` column.
Expand All @@ -119,14 +119,14 @@ enum class duplicate_find_option : int32_t {
* If `find_option == FIND_LAST`, the position of the last match in the list row is
* returned.
*
* @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
*
* @param lists Lists column whose `n` rows are to be searched
* @param search_key The scalar key to be looked up in each list row
* @param find_option Whether to return the position of the first match (`FIND_FIRST`) or
* last (`FIND_LAST`)
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return std::unique_ptr<column> INT32 column of `n` rows with the location of the `search_key`
*
* @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
* @return column of `n` rows with the location of the `search_key`
*/
std::unique_ptr<column> index_of(
cudf::lists_column_view const& lists,
Expand All @@ -135,7 +135,7 @@ std::unique_ptr<column> index_of(
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Create a column of `size_type` values indicating the position of a search key
* @brief Create a column of cudf::size_type values indicating the position of a search key
* row within the corresponding list row in the `lists` column
*
* The output column has as many elements as there are rows in the input `lists` column.
Expand All @@ -152,16 +152,16 @@ std::unique_ptr<column> index_of(
* If `find_option == FIND_LAST`, the position of the last match in the list row is
* returned.
*
* @throw cudf::logic_error If `search_keys` does not match `lists` in its number of rows
* @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
*
* @param lists Lists column whose `n` rows are to be searched
* @param search_keys A column of search keys to be looked up in each corresponding row of
* `lists`
* @param find_option Whether to return the position of the first match (`FIND_FIRST`) or
* last (`FIND_LAST`)
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return std::unique_ptr<column> INT32 column of `n` rows with the location of the `search_key`
*
* @throw cudf::logic_error If `search_keys` does not match `lists` in its number of rows
* @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
* @return column of `n` rows with the location of the `search_key`
*/
std::unique_ptr<column> index_of(
cudf::lists_column_view const& lists,
Expand Down
4 changes: 2 additions & 2 deletions cpp/include/cudf/lists/count_elements.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021-2022, NVIDIA CORPORATION.
* Copyright (c) 2021-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -46,7 +46,7 @@ namespace lists {
*
* @param input Input lists column.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column with the number of elements for each row.
* @return New column with the number of elements for each row.
*/
std::unique_ptr<column> count_elements(
lists_column_view const& input,
Expand Down
30 changes: 15 additions & 15 deletions cpp/include/cudf/strings/attributes.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2022, NVIDIA CORPORATION.
* Copyright (c) 2019-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -32,39 +32,39 @@ namespace strings {
*/

/**
* @brief Returns an integer numeric column containing the length of each string in
* characters.
* @brief Returns a column containing cudf::size_type of character lengths
* of each string in the given column
*
* The output column will have the same number of rows as the
* specified strings column. Each row value will be the number of
* characters in the corresponding string.
*
* Any null string will result in a null entry for that row in the output column.
*
* @param strings Strings instance for this operation.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column with lengths for each string.
* @param input Strings instance for this operation
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New column with lengths for each string
*/
std::unique_ptr<column> count_characters(
strings_column_view const& strings,
strings_column_view const& input,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a numeric column containing the length of each string in
* bytes.
* @brief Returns a column containing cudf::size_type of byte lengths
* of each string in the given column
*
* The output column will have the same number of rows as the
* specified strings column. Each row value will be the number of
* bytes in the corresponding string.
*
* Any null string will result in a null entry for that row in the output column.
*
* @param strings Strings instance for this operation.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column with the number of bytes for each string.
* @param input Strings instance for this operation
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New column with the number of bytes for each string
*/
std::unique_ptr<column> count_bytes(
strings_column_view const& strings,
strings_column_view const& input,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
Expand All @@ -79,12 +79,12 @@ std::unique_ptr<column> count_bytes(
*
* Any null string is ignored. No null entries will appear in the output column.
*
* @param strings Strings instance for this operation.
* @param input Strings instance for this operation.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column with code point integer values for each character.
*/
std::unique_ptr<column> code_points(
strings_column_view const& strings,
strings_column_view const& input,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of strings_apis group
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/strings/contains.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ std::unique_ptr<column> matches_re(
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New INT32 column with counts for each string
* @return New column of type cudf::size_type with counts for each string
*/
std::unique_ptr<column> count_re(
strings_column_view const& strings,
Expand Down
2 changes: 1 addition & 1 deletion cpp/include/cudf/strings/detail/strings_children.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ auto make_strings_children(SizeAndExecuteFunction size_and_exec_fn,
rmm::mr::device_memory_resource* mr)
{
auto offsets_column = make_numeric_column(
data_type{type_id::INT32}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
data_type{type_to_id<size_type>()}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
auto offsets_view = offsets_column->mutable_view();
auto d_offsets = offsets_view.template data<int32_t>();
size_and_exec_fn.d_offsets = d_offsets;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ std::unique_ptr<column> make_strings_column(CharIterator chars_begin,

// build offsets column -- this is the number of strings + 1
auto offsets_column = make_numeric_column(
data_type{type_id::INT32}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
data_type{type_to_id<size_type>()}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
auto offsets_view = offsets_column->mutable_view();
thrust::transform(rmm::exec_policy(stream),
offsets_begin,
Expand Down
6 changes: 3 additions & 3 deletions cpp/include/nvtext/detail/tokenize.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020-2022, NVIDIA CORPORATION.
* Copyright (c) 2020-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -64,7 +64,7 @@ std::unique_ptr<cudf::column> tokenize(cudf::strings_column_view const& strings,
* The default of empty string will separate tokens using whitespace.
* @param stream CUDA stream used for device memory operations and kernel launches.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column of token counts.
* @return New column of token counts
*/
std::unique_ptr<cudf::column> count_tokens(cudf::strings_column_view const& strings,
cudf::string_scalar const& delimiter,
Expand All @@ -79,7 +79,7 @@ std::unique_ptr<cudf::column> count_tokens(cudf::strings_column_view const& stri
* @param delimiters Strings used to separate each string into tokens.
* @param stream CUDA stream used for device memory operations and kernel launches.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column of token counts.
* @return New column of token counts
*/
std::unique_ptr<cudf::column> count_tokens(cudf::strings_column_view const& strings,
cudf::strings_column_view const& delimiters,
Expand Down
4 changes: 2 additions & 2 deletions cpp/include/nvtext/tokenize.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020, NVIDIA CORPORATION.
* Copyright (c) 2020-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -116,7 +116,7 @@ std::unique_ptr<cudf::column> tokenize(
* @param delimiter Strings used to separate each string into tokens.
* The default of empty string will separate tokens using whitespace.
* @param mr Device memory resource used to allocate the returned column's device memory.
* @return New INT32 column of token counts.
* @return New column of cudf::size_type of token counts
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
*/
std::unique_ptr<cudf::column> count_tokens(
cudf::strings_column_view const& strings,
Expand Down
Loading