rapidsai · rapids-bot · Mar 15, 2023 · Mar 8, 2023 · Mar 8, 2023 · Mar 8, 2023
@@ -269,6 +269,13 @@ An *immutable*, non-owning view of a table.
 
 A *mutable*, non-owning view of a table.
 
+## cudf::size_type
+
+The `cudf::size_type` is the type used for the number of elements in a column, offsets to elements within a column, indices to address specific elements, segments for subsets of column elements, etc.
+It is equivalent to a signed, 32-bit integer type and therefore has a maximum value of 2147483647. Some APIs also accept negative index values and those will
+functions would support a minimum value of -2147483648.
+This fundamental type also influences output values not just for column size limits but for counting elements as well.
+
 ## Spans
 
 libcudf provides `span` classes that mimic C++20 `std::span`, which is a lightweight
@@ -370,16 +377,16 @@ libcudf APIs should still perform any validation that does not require introspec
 To give some idea of what should or should not be validated, here are (non-exhaustive) lists of examples.
 
 **Things that libcudf should validate**:
-- Input column/table sizes or dtypes
+- Input column/table sizes or data types
 
 **Things that libcudf should not validate**:
 - Integer overflow
-- Ensuring that outputs will not exceed the 2GB size limit for a given set of inputs
+- Ensuring that outputs will not exceed the [2GB size](#cudfsize_type) limit for a given set of inputs
 
 
 ## libcudf expects nested types to have sanitized null masks
 
-Various libcudf APIs accepting columns of nested dtypes (such as `LIST` or `STRUCT`) may assume that these columns have been sanitized.
+Various libcudf APIs accepting columns of nested data types (such as `LIST` or `STRUCT`) may assume that these columns have been sanitized.
 In this context, sanitization refers to ensuring that the null elements in a column with a nested dtype are compatible with the elements of nested columns.
 Specifically:
 - Null elements of list columns should also be empty. The starting offset of a null element should be equal to the ending offset.
@@ -747,7 +754,7 @@ The "Indexalator", or index-normalizing iterator (`include/cudf/detail/indexalat
 used for index types (integers) without requiring a type-specific instance. It can be used for any
 iterator interface for reading an array of integer values of type `int8`, `int16`, `int32`,
 `int64`, `uint8`, `uint16`, `uint32`, or `uint64`. Reading specific elements always return a
-`cudf::size_type` integer.
+[`cudf::size_type`](#cudfsize_type) integer.
 
 Use the `indexalator_factory` to create an appropriate input iterator from a column_view. Example
 input iterator usage:
@@ -1064,7 +1071,7 @@ For list columns, the parent column's type is `LIST` and contains no data, but i
 the number of lists in the column, and its null mask represents the validity of each list element.
 The parent has two children.
 
-1. A non-nullable column of `INT32` elements that indicates the offset to the beginning of each list
+1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each list
    in a dense column of elements.
 2. A column containing the actual data and optional null mask for all elements of all the lists
    packed together.
@@ -1112,7 +1119,7 @@ a non-nullable column of `INT8` data. The parent column's type is `STRING` and c
 but its size represents the number of strings in the column, and its null mask represents the
 validity of each string. To summarize, the strings column children are:
 
-1. A non-nullable column of `INT32` elements that indicates the offset to the beginning of each
+1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each
    string in a dense column of all characters.
 2. A non-nullable column of `INT8` elements of all the characters across all the strings packed
    together.
@@ -1224,7 +1231,7 @@ libcudf provides view types for nested column types as well as for the data elem
 `cudf::strings_column_view` is a view of a strings column, like `cudf::column_view` is a view of
 any `cudf::column`. `cudf::string_view` is a view of a single string, and therefore
 `cudf::string_view` is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
-data type for a `cudf::column` of type `INT32`. As it's name implies, this is a read-only object
+data type for a `cudf::column` of type [`size_type`](#cudfsize_type). As it's name implies, this is a read-only object
 instance that points to device memory inside the strings column. It's lifespan is the same (or less)
 as the column it views.
 

@@ -43,7 +43,7 @@ namespace lists {
  * @param lists Lists column whose `n` rows are to be searched
  * @param search_key The scalar key to be looked up in each list row
  * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return std::unique_ptr<column> BOOL8 column of `n` rows with the result of the lookup
+ * @return BOOL8 column of `n` rows with the result of the lookup
  */
 std::unique_ptr<column> contains(
   cudf::lists_column_view const& lists,
@@ -65,7 +65,7 @@ std::unique_ptr<column> contains(
  * @param lists Lists column whose `n` rows are to be searched
  * @param search_keys Column of elements to be looked up in each list row
  * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return std::unique_ptr<column> BOOL8 column of `n` rows with the result of the lookup
+ * @return BOOL8 column of `n` rows with the result of the lookup
  */
 std::unique_ptr<column> contains(
   cudf::lists_column_view const& lists,
@@ -86,7 +86,7 @@ std::unique_ptr<column> contains(
  *
  * @param lists Lists column whose `n` rows are to be searched
  * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return std::unique_ptr<column> BOOL8 column of `n` rows with the result of the lookup
+ * @return BOOL8 column of `n` rows with the result of the lookup
  */
 std::unique_ptr<column> contains_nulls(
   cudf::lists_column_view const& lists,
@@ -102,7 +102,7 @@ enum class duplicate_find_option : int32_t {
 };
 
 /**
- * @brief Create a column of `size_type` values indicating the position of a search key
+ * @brief Create a column of cudf::size_type values indicating the position of a search key
  * within each list row in the `lists` column
  *
  * The output column has as many elements as there are rows in the input `lists` column.
@@ -119,14 +119,14 @@ enum class duplicate_find_option : int32_t {
  * If `find_option == FIND_LAST`, the position of the last match in the list row is
  * returned.
  *
+ * @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
+ *
  * @param lists Lists column whose `n` rows are to be searched
  * @param search_key The scalar key to be looked up in each list row
  * @param find_option Whether to return the position of the first match (`FIND_FIRST`) or
  * last (`FIND_LAST`)
  * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return std::unique_ptr<column> INT32 column of `n` rows with the location of the `search_key`
- *
- * @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
+ * @return column of `n` rows with the location of the `search_key`
  */
 std::unique_ptr<column> index_of(
   cudf::lists_column_view const& lists,
@@ -135,7 +135,7 @@ std::unique_ptr<column> index_of(
   rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
 
 /**
- * @brief Create a column of `size_type` values indicating the position of a search key
+ * @brief Create a column of cudf::size_type values indicating the position of a search key
  * row within the corresponding list row in the `lists` column
  *
  * The output column has as many elements as there are rows in the input `lists` column.
@@ -152,16 +152,16 @@ std::unique_ptr<column> index_of(
  * If `find_option == FIND_LAST`, the position of the last match in the list row is
  * returned.
  *
+ * @throw cudf::logic_error If `search_keys` does not match `lists` in its number of rows
+ * @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
+ *
  * @param lists Lists column whose `n` rows are to be searched
  * @param search_keys A column of search keys to be looked up in each corresponding row of
  * `lists`
  * @param find_option Whether to return the position of the first match (`FIND_FIRST`) or
  * last (`FIND_LAST`)
  * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return std::unique_ptr<column> INT32 column of `n` rows with the location of the `search_key`
- *
- * @throw cudf::logic_error If `search_keys` does not match `lists` in its number of rows
- * @throw cudf::data_type_error If `search_keys` type does not match the element type in `lists`
+ * @return column of `n` rows with the location of the `search_key`
  */
 std::unique_ptr<column> index_of(
   cudf::lists_column_view const& lists,

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2021-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2021-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -46,7 +46,7 @@ namespace lists {
  *
  * @param input Input lists column.
  * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return New INT32 column with the number of elements for each row.
+ * @return New column with the number of elements for each row.
  */
 std::unique_ptr<column> count_elements(
   lists_column_view const& input,

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2019-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2019-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -32,39 +32,39 @@ namespace strings {
  */
 
 /**
- * @brief Returns an integer numeric column containing the length of each string in
- * characters.
+ * @brief Returns a column containing cudf::size_type of character lengths
+ * of each string in the given column
  *
  * The output column will have the same number of rows as the
  * specified strings column. Each row value will be the number of
  * characters in the corresponding string.
  *
  * Any null string will result in a null entry for that row in the output column.
  *
- * @param strings Strings instance for this operation.
- * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return New INT32 column with lengths for each string.
+ * @param input Strings instance for this operation
+ * @param mr Device memory resource used to allocate the returned column's device memory
+ * @return New column with lengths for each string
  */
 std::unique_ptr<column> count_characters(
-  strings_column_view const& strings,
+  strings_column_view const& input,
   rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
 
 /**
- * @brief Returns a numeric column containing the length of each string in
- * bytes.
+ * @brief Returns a column containing cudf::size_type of byte lengths
+ * of each string in the given column
  *
  * The output column will have the same number of rows as the
  * specified strings column. Each row value will be the number of
  * bytes in the corresponding string.
  *
  * Any null string will result in a null entry for that row in the output column.
  *
- * @param strings Strings instance for this operation.
- * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return New INT32 column with the number of bytes for each string.
+ * @param input Strings instance for this operation
+ * @param mr Device memory resource used to allocate the returned column's device memory
+ * @return New column with the number of bytes for each string
  */
 std::unique_ptr<column> count_bytes(
-  strings_column_view const& strings,
+  strings_column_view const& input,
   rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
 
 /**
@@ -79,12 +79,12 @@ std::unique_ptr<column> count_bytes(
  *
  * Any null string is ignored. No null entries will appear in the output column.
  *
- * @param strings Strings instance for this operation.
+ * @param input Strings instance for this operation.
  * @param mr Device memory resource used to allocate the returned column's device memory.
  * @return New INT32 column with code point integer values for each character.
  */
 std::unique_ptr<column> code_points(
-  strings_column_view const& strings,
+  strings_column_view const& input,
   rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
 
 /** @} */  // end of strings_apis group

@@ -165,7 +165,7 @@ std::unique_ptr<column> matches_re(
  * @param strings Strings instance for this operation
  * @param prog Regex program instance
  * @param mr Device memory resource used to allocate the returned column's device memory
- * @return New INT32 column with counts for each string
+ * @return New column of type cudf::size_type with counts for each string
  */
 std::unique_ptr<column> count_re(
   strings_column_view const& strings,

@@ -59,7 +59,7 @@ auto make_strings_children(SizeAndExecuteFunction size_and_exec_fn,
                            rmm::mr::device_memory_resource* mr)
 {
   auto offsets_column = make_numeric_column(
-    data_type{type_id::INT32}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
+    data_type{type_to_id<size_type>()}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
   auto offsets_view          = offsets_column->mutable_view();
   auto d_offsets             = offsets_view.template data<int32_t>();
   size_and_exec_fn.d_offsets = d_offsets;

@@ -175,7 +175,7 @@ std::unique_ptr<column> make_strings_column(CharIterator chars_begin,
 
   // build offsets column -- this is the number of strings + 1
   auto offsets_column = make_numeric_column(
-    data_type{type_id::INT32}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
+    data_type{type_to_id<size_type>()}, strings_count + 1, mask_state::UNALLOCATED, stream, mr);
   auto offsets_view = offsets_column->mutable_view();
   thrust::transform(rmm::exec_policy(stream),
                     offsets_begin,

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020-2022, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -64,7 +64,7 @@ std::unique_ptr<cudf::column> tokenize(cudf::strings_column_view const& strings,
  *                  The default of empty string will separate tokens using whitespace.
  * @param stream CUDA stream used for device memory operations and kernel launches.
  * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return New INT32 column of token counts.
+ * @return New column of token counts
  */
 std::unique_ptr<cudf::column> count_tokens(cudf::strings_column_view const& strings,
                                            cudf::string_scalar const& delimiter,
@@ -79,7 +79,7 @@ std::unique_ptr<cudf::column> count_tokens(cudf::strings_column_view const& stri
  * @param delimiters Strings used to separate each string into tokens.
  * @param stream CUDA stream used for device memory operations and kernel launches.
  * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return New INT32 column of token counts.
+ * @return New column of token counts
  */
 std::unique_ptr<cudf::column> count_tokens(cudf::strings_column_view const& strings,
                                            cudf::strings_column_view const& delimiters,

@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020, NVIDIA CORPORATION.
+ * Copyright (c) 2020-2023, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -116,7 +116,7 @@ std::unique_ptr<cudf::column> tokenize(
  * @param delimiter Strings used to separate each string into tokens.
  *                  The default of empty string will separate tokens using whitespace.
  * @param mr Device memory resource used to allocate the returned column's device memory.
- * @return New INT32 column of token counts.
+ * @return New column of cudf::size_type of token counts
  */
 std::unique_ptr<cudf::column> count_tokens(
   cudf::strings_column_view const& strings,