Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Require null counts on column construction #11968

Closed
vyasr opened this issue Oct 21, 2022 · 1 comment · Fixed by #13372
Closed

[FEA] Require null counts on column construction #11968

vyasr opened this issue Oct 21, 2022 · 1 comment · Fixed by #13372
Labels
1 - On Deck To be worked on next feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@vyasr
Copy link
Contributor

vyasr commented Oct 21, 2022

Is your feature request related to a problem? Please describe.
Currently the null count may be provided when constructing a column, or it may be left unknown to be computed lazily when it is needed. This approach has performance benefits in some cases, but it is problematic for exposing a stream-ordered API because null counts may be used in contexts where a stream is not available and requesting the null count instead triggers a kernel launch on cudf's default stream. Additionally, the current approach sometimes leads to calling code not providing a null count even though it is known, resulting in an unnecessary extra computation.

Describe the solution you'd like
We should change the column constructor to require the null count on construction. This change will guarantee that requesting the null count does not result in a kernel launch, which removes the most problematic known blocker for implementing a stream-ordered API in libcudf. We will also need to make the same change to column_view. When constructed from a column, the column_view's null count will now be trivially available provided, but when constructed from an external data pointer it will again be the caller's responsibility to provide the null count.

To support these new requirements, we should expose public utility functions for computing the null count given a null mask.

Describe alternatives you've considered
We could instead have the column/column_view constructors accept the null count optionally and then compute the null count if one is not provided. However, that would also necessitate accepting a stream in the constructor that may be used. The resulting API is more confusing and requires more logic, whereas the calling code already has all the information available to make the best decision here.

Additional context
There are certain cases that will be made slower by this requirement. Two cases in particular come to mind that we will need to be cognizant of and attempt to mitigate the costs:

  1. Places where null counts are not needed. In such cases, we will now be requiring a computation that could previously be omitted entirely.
  2. The creation of slices. Slices will also now need to always know their null counts, which could add significant costs to a number of APIs relying on slices. These functions may need to be rewritten to more intelligently precompute the null counts, or be otherwise modified to reduce and/or hide the costs of this extra work.
@vyasr vyasr added feature request New feature or request 1 - On Deck To be worked on next libcudf Affects libcudf (C++/CUDA) code. labels Oct 21, 2022
@vyasr vyasr added this to the Enable streams milestone Oct 21, 2022
@jrhemstad
Copy link
Contributor

Places where null counts are not needed. In such cases, we will now be requiring a computation that could previously be omitted entirely.

For completeness, one other idea for what we could do here is still allow the null count to be unknown, but change the behavior of the null_count() member function to throw an exception if invoked when the underlying null count is unknown.

This would make its behavior more akin to a std::optional than a lazily computed and memoized member.

I'm not exactly advocating for this idea, but I wanted to at least mention it.

rapids-bot bot pushed a commit that referenced this issue Apr 10, 2023
This PR replaces uses of `cudf::UNKNOWN_NULL_COUNT` where the null count is either already known or trivially computed.

Contributes to #11968

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13102
rapids-bot bot pushed a commit that referenced this issue Apr 13, 2023
rapids-bot bot pushed a commit that referenced this issue Apr 13, 2023
The total number of nulls in the output can be computed by summing the nulls in the input columns.

Contributes to #11968

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)

URL: #13104
rapids-bot bot pushed a commit that referenced this issue Apr 14, 2023
Calculates the null-count in the `cudf::detail::slice()` function. This requires adding a stream parameter to the function and updating the callers to pass the stream. Also moved the function definition to the `slice.cu` file since there are only two possible values for the template parameter.

Labeling this with non-breaking since it is a detail function.

Contributes to: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Divye Gala (https://github.com/divyegala)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #13124
rapids-bot bot pushed a commit that referenced this issue Apr 14, 2023
Removes the `UNKNOWN_NULL_COUNT` usage in the `linked_column_view::column_view()` conversion operator. The null-count is copied from the parent instance. The `linked_column_view` class was reworked to move the C++ function definitions from the header file to a new .cpp file.

Contributes to: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13121
rapids-bot bot pushed a commit that referenced this issue Apr 14, 2023
Change the `cudf::test::make_null_mask` to return both the null-mask and the null-count. Callers can then use this null-count instead of `UNKNOWN_NULL_COUNT`. These changes include removing `UNKNOWN_NULL_COUNT` usage from the libcudf C++ test source code.

One side-effect found that strings column with all nulls can technically have no children but using `UNKNOWN_NULL_COUNT` allowed the check for this to be bypassed. Therefore many utilities started to fail when `UNKNOWN_NULL_COUNT` was removed. The factory was modified to remove the check which results in an offsets column and an empty chars column as children.

More code will likely need to be change when the `UNKNOWN_NULL_COUNT` is no longer used as a default parameter for factories and other column functions.

No behavior is changed. Since the `cudf::test::make_null_mask` is technically a public API, this PR could be marked as a breaking change as well.

Contributes to: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #13081
rapids-bot bot pushed a commit that referenced this issue Apr 17, 2023
)

Add `null_count` parameter to the `cudf::io::json::experimental::detail::parse_data` function which already accepts a `null_mask`. Normally, the callers already know the count. This unction can use the parameter to help build the output column.

Found while working on #13081
Contributes to: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13107
shwina pushed a commit to shwina/cudf that referenced this issue Apr 18, 2023
…idsai#13107)

Add `null_count` parameter to the `cudf::io::json::experimental::detail::parse_data` function which already accepts a `null_mask`. Normally, the callers already know the count. This unction can use the parameter to help build the output column.

Found while working on rapidsai#13081
Contributes to: rapidsai#11968

Authors:
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#13107
rapids-bot bot pushed a commit that referenced this issue Apr 20, 2023
Fix calls to the `column::set_null_mask()` function to always pass the null-count value in gtests source files.

Contributes to: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #13148
rapids-bot bot pushed a commit that referenced this issue Apr 20, 2023
Fix calls to the `column::set_null_mask()` function to always pass the null-count value in libcudf source files.

Contributes to: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mark Harris (https://github.com/harrism)

URL: #13149
rapids-bot bot pushed a commit that referenced this issue Apr 26, 2023
rapids-bot bot pushed a commit that referenced this issue May 4, 2023
…ctory (#13227)

Removes the `UNKNOWN_NULL_COUNT` default parameter from `cudf::make_strings_column` factory functions and updates the callers to always pass the null-count.

Labeling this with breaking since it is a public function.

Contributes to: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13227
rapids-bot bot pushed a commit that referenced this issue May 5, 2023
…ns (#13258)

Removes the `UNKNOWN_NULL_COUNT` as the default value for the null-count parameter for `cudf::make_fixed_point_column` and `cudf::make_numeric_column` factory functions.
Also fixes some places where the `stream` parameter was not being passed along.

Labeling this with breaking since it is a public function.

Contributes to: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13258
rapids-bot bot pushed a commit that referenced this issue May 15, 2023
)

Changes `cudf::state_null_count` to throw a `std::invalid_argument` error which `mask_state::UNINITIALIZED` is passed. This change is part of removing `UNKNOWN_NULL_COUNT` from cudf/libcudf.

Reference: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: #13292
rapids-bot bot pushed a commit that referenced this issue May 15, 2023
…#13331)

Fixes a few more places where the null count was not being set in a libcudf function before returning a column. A couple of places were setting the null count value into the mutable-view object. So the value was calculated correctly but not set into the actual column object.

Reference: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13331
rapids-bot bot pushed a commit that referenced this issue May 15, 2023
Changes `cudf::detail::concatenate_masks` to also return the null count. This saves computing the null count in a separate kernel launch where this function is used. Also changes the `detail/concatenate.cuh` header to `detail/concatenate_masks.hpp` since contains no device code and only includes the `detail::concatenate_masks()` functions. The `detail::concatenate()` functions are already declared in `detail/concatenate.hpp`.
This is marked as non-breaking since it only effects a detail function.
Reference: #11968

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #13330
mythrocks added a commit to mythrocks/cudf that referenced this issue May 16, 2023
Fixes rapidsai#13353.

In preparation for rapidsai#11968, this change ensures that columns constructed from CUDF JNI do not have their null counts set to `UNKNOWN_NULL_COUNT` (i.e. `-1`). In cases where the caller invokes JNI functions with `UNKNOWN_NULL_COUNT`, the JNI layer computes the concrete null count from the validity mask, and sets this value in the column.

The option to specify an optional null count through the Java API will likely be removed at a later date.

Signed-off-by: MithunR <[email protected]>
rapids-bot bot pushed a commit that referenced this issue May 16, 2023
This is a breaking change that removes default values. Removing the default null count value of `UNKNOWN_NULL_COUNT` is necessary since we are removing `UNKNOWN_NULL_COUNT` altogether. A potential alternative was setting the default to 0, which would correspond to a default null mask of `nullptr` (the current default). However, that change is potentially more dangerous if callers were previously setting the null mask to something other than `nullptr` without setting the null count and relying on the behavior of `UNKNOWN_NULL_COUNT`. Removing the default parameter altogether is therefore the safer option here.

Contributes to #11968.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #13311
rapids-bot bot pushed a commit that referenced this issue May 16, 2023
@vyasr vyasr mentioned this issue May 17, 2023
3 tasks
mythrocks added a commit to mythrocks/spark-rapids-jni that referenced this issue May 17, 2023
This is in prep for rapidsai/cudf#11968 and
rapidsai/cudf#13372.

`libcudf` will soon require that all CUDF columns are created with a
known null-count. `UNKNOWN_NULL_COUNT` will no longer be supported,
or even available as a code constant.

This change replicates part of rapidsai/cudf#13355,
as it applies to `row_conversion.cu`. The (single) reference to
the unknown-null-count is replaced with a pre-calculated value.

Signed-off-by: MithunR <[email protected]>
rapids-bot bot pushed a commit that referenced this issue May 18, 2023
Fixes #13353.
Depends on #13345.
    
In preparation for #11968, this change ensures that columns constructed from CUDF JNI do not have their null counts set to `UNKNOWN_NULL_COUNT` (i.e. `-1`). In cases where the caller invokes JNI functions with `UNKNOWN_NULL_COUNT`, the JNI layer computes the concrete null count from the validity mask, and sets this value in the column.

The current Java API remains unchanged; there should be no impact to user code.
    
The option to specify an optional null count through the Java API will likely be removed at a later date.
    
Signed-off-by: MithunR <[email protected]>

Authors:
  - MithunR (https://github.com/mythrocks)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13355
mythrocks added a commit to NVIDIA/spark-rapids-jni that referenced this issue May 18, 2023
This is in prep for rapidsai/cudf#11968 and
rapidsai/cudf#13372.

`libcudf` will soon require that all CUDF columns are created with a
known null-count. `UNKNOWN_NULL_COUNT` will no longer be supported,
or even available as a code constant.

This change replicates part of rapidsai/cudf#13355,
as it applies to `row_conversion.cu`. The (single) reference to
the unknown-null-count is replaced with a pre-calculated value.

Signed-off-by: MithunR <[email protected]>
rapids-bot bot pushed a commit that referenced this issue May 24, 2023
This is the final PR for removing `UNKNOWN_NULL_COUNT` and the implicit kernel launch in the `null_count` methods of `column` and `column_view`. 

Depends on #13355 and #13341.

Closes #11968

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13372
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants