-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] corruption in string column after contig split + partition #10717
Comments
@davidwendt graciously offered to take a look at this issue, so I've assigned it to him. |
This seems like another case of: #4480, where This patch fixes this particular instance, but I am not 100% sure if this is the correct place, but essentially I believe we should detect that the element is the empty string, and return If this is the correct place (
|
@abellina I don't agree with this solution. I want to keep this function as efficient as possible. I think there is a better way to fix this where the specific factory functions are used (which are rarely). I will work on this tomorrow. |
Closes #10717 Fixes bug introduced with changes in #10673 which uses the `cudf::make_strings_column` that accepts a span of `string_view` objects with a null-placeholder. The placeholder can be unintentionally created in `create_string_vector_from_column` when given a strings column where all the rows are empty. The utility is fixed to prevent creating the placeholder for empty strings. A gtest was added to scatter from/to an all-empty strings column to verify this behavior. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) - Robert Maynard (https://github.com/robertmaynard) URL: #10724
This issue happens after a combination of cudf calls are applied and started occurring after this PR was merged: #10673.
If we first call
contiguous_split
with 0 splits (an operation that should return the original data but packed in a single buffer) and later callpartition
on a row that has an empty string""
, the result is a null string instead of the original empty string.Before the linked PR this wasn't an issue (as we noticed this in tests that broke NVIDIA/spark-rapids#5286). If
contiguous_split
is removed from the sequence of calls, the issue doesn't manifest.Here's a repro case, and thanks to @davidwendt for taking a cuDF java example I had and converting it to the C++ equivalent.
In this case, the last print:
cudf::test::print(result.first->view().column(0));
, printsNULL
when it should be the empty string.Thanks to @jlowe and @revans2 for pointing me in the right direction.
The text was updated successfully, but these errors were encountered: