-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly set bitmask size in from_column_view
#13315
Correctly set bitmask size in from_column_view
#13315
Conversation
An empty struct column (dtype of StructDtype({})) has no children, and hence a base_size of zero. However, it may still have a non-zero size and non-empty null mask. When slicing such a column, the mask size must be transferred over correctly by inspecting the size and offset of the owning column. Previously, we incorrectly determined the sliced column to have a mask buffer of zero bytes in this case. Closes rapidsai#13305.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another self-review, please check my working here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that in the future someone may use base_size
when they should be using size + offset
, and I really would like for those both to be equal always.
Would redefining the base_size
of a struct column as follows fix this bug? (and would it break anything else?)
class StructColumn:
...
@property
def base_size(self):
if not self.base_children:
# this can happen if the column is empty
# OR if the column is all nulls
return self.null_count
else:
return len(self.base_children[0])
This branch is wrong, it should be
I think this branch is right. So perhaps this would do the trick: diff --git a/python/cudf/cudf/core/column/struct.py b/python/cudf/cudf/core/column/struct.py
index 6838d71164..f8ae4ab846 100644
--- a/python/cudf/cudf/core/column/struct.py
+++ b/python/cudf/cudf/core/column/struct.py
@@ -29,9 +29,11 @@ class StructColumn(ColumnBase):
@property
def base_size(self):
if not self.base_children:
- return 0
+ return self.size + self.offset
else:
- return len(self.base_children[0])
+ size = len(self.base_children[0])
+ assert size == self.size + self.offset, "Unpossible"
+ return size
def to_arrow(self):
children = [ |
def test_struct_empty_children_nulls_slice(indices): | ||
values = [None, {}, {}, None] | ||
|
||
s = cudf.Series([None, {}, {}, None]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reuse the input from the previous line if it’s equivalent.
s = cudf.Series([None, {}, {}, None]) | |
s = cudf.Series(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me! Do you need to cover test cases with non-contiguous indices for iloc or is it only affecting “proper slices” with step=1?
Turns out this is regularly possible (if one slices a column, |
Hmm, that's right. Sorry, I got this very wrong. That being said, would you agree that the
When does this branch return the wrong answer? |
Yes.
The column doesn't have to be all nulls:
|
It is only "proper slices" where the result of the slice shares data with the sliced column, so yes, only "proper" indices need covered. |
I added coverage of non-stride-1 slices just to be sure. |
/merge |
Description
An empty struct column (dtype of StructDtype({})) has no children, and
hence a base_size of zero. However, it may still have a non-zero size
and non-empty null mask. When slicing such a column, the mask size
must be transferred over correctly by inspecting the size and offset
of the owning column. Previously, we incorrectly determined the sliced
column to have a mask buffer of zero bytes in this case.
Closes #13305.
Checklist