Fix `setitem` on string columns when the scalar value ends in a null byte #12991

wence- · 2023-03-22T15:03:37Z

Description

Since numpy strings are fixed width and use a null byte as an
indicator of the end of the string, there is no way to distinguish
between numpy.str_("abc\x00").item() and numpy.str_("abc").item().
This has consequences for scalar preprocessing we do when constructing
a cudf.Scalar, since that usually goes through
numpy.astype(...).item(). So, when preprocessing as scalar, if we
notice it is a string with trailing null bytes, keep it as is.

Closes #12990.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Since numpy strings are fixed width and use a null byte as an indicator of the end of the string, there is no way to distinguish between numpy.str_("abc\x00").item() and numpy.str_("abc").item(). This has consequences for scalar preprocessing we do when constructing a cudf.Scalar, since that usually goes through numpy.astype(...).item(). So, when preprocessing as scalar, if we notice it is a string with trailing null bytes, keep it as is. Closes rapidsai#12990.

galipremsagar · 2023-03-23T16:18:54Z

/merge

wence- added 2 commits March 22, 2023 12:53

Add tests of setitem with trailing null in strings

3891d33

wence- requested a review from a team as a code owner March 22, 2023 15:03

wence- requested review from shwina and brandon-b-miller March 22, 2023 15:03

github-actions bot added the Python Affects Python cuDF API. label Mar 22, 2023

wence- self-assigned this Mar 22, 2023

wence- added bug Something isn't working non-breaking Non-breaking change labels Mar 22, 2023

wence- added this to the Pandas API Alignment and Coverage milestone Mar 22, 2023

wence- mentioned this pull request Mar 22, 2023

Fix sort_values when column is all empty strings #12988

Merged

3 tasks

wence- changed the title ~~Fix __setitem__ on string columns when the scalar value is a null byte~~ Fix __setitem__ on string columns when the scalar value ends in a null byte Mar 22, 2023

More isinstance

cb48b52

galipremsagar approved these changes Mar 23, 2023

View reviewed changes

galipremsagar added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 23, 2023

rapids-bot bot merged commit 3a2609b into rapidsai:branch-23.04 Mar 23, 2023

wence- deleted the wence/fix/12990 branch March 23, 2023 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `setitem` on string columns when the scalar value ends in a null byte #12991

Fix `setitem` on string columns when the scalar value ends in a null byte #12991

wence- commented Mar 22, 2023

galipremsagar commented Mar 23, 2023

Fix __setitem__ on string columns when the scalar value ends in a null byte #12991

Fix __setitem__ on string columns when the scalar value ends in a null byte #12991

Conversation

wence- commented Mar 22, 2023

Description

Checklist

galipremsagar commented Mar 23, 2023

Fix `setitem` on string columns when the scalar value ends in a null byte #12991

Fix `setitem` on string columns when the scalar value ends in a null byte #12991