-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Fix data corruption in string columns #7746
Conversation
def test_string_slice_with_mask(): | ||
actual = cudf.Series(["hi", "hello", None]) | ||
expected = actual[0:3] | ||
|
||
assert actual._column.base_size == 3 | ||
assert_eq(actual._column.base_size, expected._column.base_size) | ||
assert_eq(actual._column.null_count, expected._column.null_count) | ||
assert_eq( | ||
actual._column.mask.to_host_array(), | ||
expected._column.mask.to_host_array(), | ||
) | ||
assert_eq(actual, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we convert this to something that tests user-facing behaviour rather than internal behaviour?
In other words, did this bug manifest in a way that affected end-users? If so, can we test that we fixed that instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had this similar thought initially and thought we could check with isnull
public API, but since this goes to libcudf call and that returns the correct result without interacting with Column.mask
we cannot validate using Series.isnull
.
The closest user-facing behavior where this issue would surface is when we round-trip(when it goes through serialize
) a series with string dtype & having nulls like in test_distributed::test_str_series_roundtrip
.
Had to add both user-facing & internal test because we don't seem to validate base_size
anywhere except for this test where we only test against an empty column:
cudf/python/cudf/cudf/tests/test_string.py
Lines 1315 to 1318 in ccc28d5
def test_string_no_children_properties(): | |
empty_col = StringColumn(children=()) | |
assert empty_col.base_children == () | |
assert empty_col.base_size == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we should definitely add a test for that then (maybe in test_serialize.py
?).
We can keep this test in addition, if you prefer. Personally, I'm not a fan of testing internals, but that could be just me :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a serialization test where we would still have to validate an internal component(i.e., the frames) and removed checking for mask and retained base_size
checks in test_string.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand -- what internal attribute are we testing in the new serialize test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serialize returns a dict & Frames as buffers. The internal attribute we are testing here is Frames([index_frame, offset_frame, chars_frame, mask_frame])
, to be specific we want to validate the mask_frame
at last index which would be the right validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks!
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7746 +/- ##
===============================================
+ Coverage 82.13% 82.54% +0.40%
===============================================
Files 101 101
Lines 17096 17461 +365
===============================================
+ Hits 14042 14413 +371
+ Misses 3054 3048 -6
Continue to review full report at Codecov.
|
@gpucibot merge |
Fixes: #7735
Minimal repro of the above issue is:
Incorrect mask calculation in
Column.from_column_view
because of incorrectbase_size
calculation inStringColumn
:So in this PR I have fixed the calculation of
StringColumn.base_size
and introduced tests to have a check for the same.