-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Require null counts on column construction #11968
Comments
For completeness, one other idea for what we could do here is still allow the null count to be unknown, but change the behavior of the This would make its behavior more akin to a I'm not exactly advocating for this idea, but I wanted to at least mention it. |
This PR replaces uses of `cudf::UNKNOWN_NULL_COUNT` where the null count is either already known or trivially computed. Contributes to #11968 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #13102
Contributes to #11968. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jason Lowe (https://github.com/jlowe) - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #13134
The total number of nulls in the output can be computed by summing the nulls in the input columns. Contributes to #11968 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) URL: #13104
Calculates the null-count in the `cudf::detail::slice()` function. This requires adding a stream parameter to the function and updating the callers to pass the stream. Also moved the function definition to the `slice.cu` file since there are only two possible values for the template parameter. Labeling this with non-breaking since it is a detail function. Contributes to: #11968 Authors: - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Nghia Truong (https://github.com/ttnghia) - Divye Gala (https://github.com/divyegala) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13124
Removes the `UNKNOWN_NULL_COUNT` usage in the `linked_column_view::column_view()` conversion operator. The null-count is copied from the parent instance. The `linked_column_view` class was reworked to move the C++ function definitions from the header file to a new .cpp file. Contributes to: #11968 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) URL: #13121
Change the `cudf::test::make_null_mask` to return both the null-mask and the null-count. Callers can then use this null-count instead of `UNKNOWN_NULL_COUNT`. These changes include removing `UNKNOWN_NULL_COUNT` usage from the libcudf C++ test source code. One side-effect found that strings column with all nulls can technically have no children but using `UNKNOWN_NULL_COUNT` allowed the check for this to be bypassed. Therefore many utilities started to fail when `UNKNOWN_NULL_COUNT` was removed. The factory was modified to remove the check which results in an offsets column and an empty chars column as children. More code will likely need to be change when the `UNKNOWN_NULL_COUNT` is no longer used as a default parameter for factories and other column functions. No behavior is changed. Since the `cudf::test::make_null_mask` is technically a public API, this PR could be marked as a breaking change as well. Contributes to: #11968 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - MithunR (https://github.com/mythrocks) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13081
) Add `null_count` parameter to the `cudf::io::json::experimental::detail::parse_data` function which already accepts a `null_mask`. Normally, the callers already know the count. This unction can use the parameter to help build the output column. Found while working on #13081 Contributes to: #11968 Authors: - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) URL: #13107
…idsai#13107) Add `null_count` parameter to the `cudf::io::json::experimental::detail::parse_data` function which already accepts a `null_mask`. Normally, the callers already know the count. This unction can use the parameter to help build the output column. Found while working on rapidsai#13081 Contributes to: rapidsai#11968 Authors: - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#13107
Fix calls to the `column::set_null_mask()` function to always pass the null-count value in gtests source files. Contributes to: #11968 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13148
Fix calls to the `column::set_null_mask()` function to always pass the null-count value in libcudf source files. Contributes to: #11968 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vyas Ramasubramani (https://github.com/vyasr) - Mark Harris (https://github.com/harrism) URL: #13149
Contributes to #11968 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - David Wendt (https://github.com/davidwendt) - Divye Gala (https://github.com/divyegala) - Nghia Truong (https://github.com/ttnghia) URL: #13205
…ctory (#13227) Removes the `UNKNOWN_NULL_COUNT` default parameter from `cudf::make_strings_column` factory functions and updates the callers to always pass the null-count. Labeling this with breaking since it is a public function. Contributes to: #11968 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Karthikeyan (https://github.com/karthikeyann) URL: #13227
…ns (#13258) Removes the `UNKNOWN_NULL_COUNT` as the default value for the null-count parameter for `cudf::make_fixed_point_column` and `cudf::make_numeric_column` factory functions. Also fixes some places where the `stream` parameter was not being passed along. Labeling this with breaking since it is a public function. Contributes to: #11968 Authors: - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) URL: #13258
) Changes `cudf::state_null_count` to throw a `std::invalid_argument` error which `mask_state::UNINITIALIZED` is passed. This change is part of removing `UNKNOWN_NULL_COUNT` from cudf/libcudf. Reference: #11968 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #13292
…#13331) Fixes a few more places where the null count was not being set in a libcudf function before returning a column. A couple of places were setting the null count value into the mutable-view object. So the value was calculated correctly but not set into the actual column object. Reference: #11968 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) URL: #13331
Changes `cudf::detail::concatenate_masks` to also return the null count. This saves computing the null count in a separate kernel launch where this function is used. Also changes the `detail/concatenate.cuh` header to `detail/concatenate_masks.hpp` since contains no device code and only includes the `detail::concatenate_masks()` functions. The `detail::concatenate()` functions are already declared in `detail/concatenate.hpp`. This is marked as non-breaking since it only effects a detail function. Reference: #11968 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) - AJ Schmidt (https://github.com/ajschmidt8) URL: #13330
Fixes rapidsai#13353. In preparation for rapidsai#11968, this change ensures that columns constructed from CUDF JNI do not have their null counts set to `UNKNOWN_NULL_COUNT` (i.e. `-1`). In cases where the caller invokes JNI functions with `UNKNOWN_NULL_COUNT`, the JNI layer computes the concrete null count from the validity mask, and sets this value in the column. The option to specify an optional null count through the Java API will likely be removed at a later date. Signed-off-by: MithunR <[email protected]>
This is a breaking change that removes default values. Removing the default null count value of `UNKNOWN_NULL_COUNT` is necessary since we are removing `UNKNOWN_NULL_COUNT` altogether. A potential alternative was setting the default to 0, which would correspond to a default null mask of `nullptr` (the current default). However, that change is potentially more dangerous if callers were previously setting the null mask to something other than `nullptr` without setting the null count and relying on the behavior of `UNKNOWN_NULL_COUNT`. Removing the default parameter altogether is therefore the safer option here. Contributes to #11968. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #13311
Part of #11968 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Nghia Truong (https://github.com/ttnghia) - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) URL: #13345
This is in prep for rapidsai/cudf#11968 and rapidsai/cudf#13372. `libcudf` will soon require that all CUDF columns are created with a known null-count. `UNKNOWN_NULL_COUNT` will no longer be supported, or even available as a code constant. This change replicates part of rapidsai/cudf#13355, as it applies to `row_conversion.cu`. The (single) reference to the unknown-null-count is replaced with a pre-calculated value. Signed-off-by: MithunR <[email protected]>
Fixes #13353. Depends on #13345. In preparation for #11968, this change ensures that columns constructed from CUDF JNI do not have their null counts set to `UNKNOWN_NULL_COUNT` (i.e. `-1`). In cases where the caller invokes JNI functions with `UNKNOWN_NULL_COUNT`, the JNI layer computes the concrete null count from the validity mask, and sets this value in the column. The current Java API remains unchanged; there should be no impact to user code. The option to specify an optional null count through the Java API will likely be removed at a later date. Signed-off-by: MithunR <[email protected]> Authors: - MithunR (https://github.com/mythrocks) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jason Lowe (https://github.com/jlowe) - Nghia Truong (https://github.com/ttnghia) URL: #13355
This is in prep for rapidsai/cudf#11968 and rapidsai/cudf#13372. `libcudf` will soon require that all CUDF columns are created with a known null-count. `UNKNOWN_NULL_COUNT` will no longer be supported, or even available as a code constant. This change replicates part of rapidsai/cudf#13355, as it applies to `row_conversion.cu`. The (single) reference to the unknown-null-count is replaced with a pre-calculated value. Signed-off-by: MithunR <[email protected]>
This is the final PR for removing `UNKNOWN_NULL_COUNT` and the implicit kernel launch in the `null_count` methods of `column` and `column_view`. Depends on #13355 and #13341. Closes #11968 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #13372
Is your feature request related to a problem? Please describe.
Currently the null count may be provided when constructing a column, or it may be left unknown to be computed lazily when it is needed. This approach has performance benefits in some cases, but it is problematic for exposing a stream-ordered API because null counts may be used in contexts where a stream is not available and requesting the null count instead triggers a kernel launch on cudf's default stream. Additionally, the current approach sometimes leads to calling code not providing a null count even though it is known, resulting in an unnecessary extra computation.
Describe the solution you'd like
We should change the column constructor to require the null count on construction. This change will guarantee that requesting the null count does not result in a kernel launch, which removes the most problematic known blocker for implementing a stream-ordered API in libcudf. We will also need to make the same change to
column_view
. When constructed from acolumn
, thecolumn_view
's null count will now be trivially available provided, but when constructed from an external data pointer it will again be the caller's responsibility to provide the null count.To support these new requirements, we should expose public utility functions for computing the null count given a null mask.
Describe alternatives you've considered
We could instead have the column/column_view constructors accept the null count optionally and then compute the null count if one is not provided. However, that would also necessitate accepting a stream in the constructor that may be used. The resulting API is more confusing and requires more logic, whereas the calling code already has all the information available to make the best decision here.
Additional context
There are certain cases that will be made slower by this requirement. Two cases in particular come to mind that we will need to be cognizant of and attempt to mitigate the costs:
The text was updated successfully, but these errors were encountered: