-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve empty list array elements in take kernel #3473
Conversation
comment says: [[0,null,0], [-1,-2,3], null, [5,null]] which implies null buffer of: 0b11111011
comment says: [[0,null,0], [-1,-2,3], [null], [5,null]] which has not null values at the list level, or a null buffer of 0b11111011
this way we can distinguish empty list elements from null elements.
@@ -1519,12 +1524,12 @@ mod tests { | |||
|
|||
macro_rules! test_take_list { | |||
($offset_type:ty, $list_data_type:ident, $list_array_type:ident) => {{ | |||
// Construct a value array, [[0,0,0], [-1,-2,-1], [2,3]] | |||
// Construct a value array, [[0,0,0], [-1,-2,-1], [], [2,3]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was updated to add an extra empty array element and to make sure it's preserved in the output of take
@@ -1609,7 +1612,7 @@ mod tests { | |||
let list_data = ArrayData::builder(list_data_type.clone()) | |||
.len(4) | |||
.add_buffer(value_offsets) | |||
.null_bit_buffer(Some(Buffer::from([0b10111101, 0b00000000]))) | |||
.null_bit_buffer(Some(Buffer::from([0b11111111]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intended array here is [[0,null,0], [-1,-2,3], [null], [5,null]]
, which has no null elements in the outer list array.
@@ -1682,7 +1685,7 @@ mod tests { | |||
let list_data = ArrayData::builder(list_data_type.clone()) | |||
.len(4) | |||
.add_buffer(value_offsets) | |||
.null_bit_buffer(Some(Buffer::from([0b01111101]))) | |||
.null_bit_buffer(Some(Buffer::from([0b11111011]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intended array here is [[0,null,0], [-1,-2,3], null, [5,null]]
, where only the third element is null
.
@@ -831,10 +815,18 @@ where | |||
/// Where a list array has indices `[0,2,5,10]`, taking indices of `[2,0]` returns | |||
/// an array of the indices `[5..10, 0..2]` and offsets `[0,5,7]` (5 elements and 2 | |||
/// elements) | |||
#[allow(clippy::type_complexity)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you'd like the return type here refactored with type
definitions instead of ignoring the clippy lint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to at least have some documentation of what the various fields are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this, this looks good to me. I left some minor comments on how it could be improved further, but I'm happy for this to go in as is
@@ -831,10 +815,18 @@ where | |||
/// Where a list array has indices `[0,2,5,10]`, taking indices of `[2,0]` returns | |||
/// an array of the indices `[5..10, 0..2]` and offsets `[0,5,7]` (5 elements and 2 | |||
/// elements) | |||
#[allow(clippy::type_complexity)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to at least have some documentation of what the various fields are.
|
||
// Initialize null buffer | ||
let num_bytes = bit_util::ceil(indices.len(), 8); | ||
let mut null_buf = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true); | ||
let null_slice = null_buf.as_slice_mut(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than moving this into this function, I think it might be faster and simpler to use take_bits
on the list null buffer in the outer function. This would also resolve the type complexity issue above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, but we also need to handle the indices being null. So I think we'd need to perform an AND operation between the result of take_bits
on the list's null buffer and the indices null buffer. And that seemed a bit more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that is what take_bits did, but my memory is a little hazy
); | ||
} | ||
let value_offsets = Buffer::from_slice_ref(&offsets); | ||
let value_offsets = Buffer::from_slice_ref(offsets); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not something from this PR, but it occurs to me that it would be better for take_value_indices_from_list to simply build and return this buffer, e.g. using BufferBuilder, instead of this additional copy.
Thanks for looking it over and for the comments @tustvold, I'll give it another pass in a few hours |
I added a comment on the return types, but I wasn't confident in making the other changes. Feel free to commit onto this branch if you want to do any other cleanup. Thanks again! |
Benchmark runs are scheduled for baseline = ca7ea59 and contender = c28d69a. c28d69a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Thanks again, if I have some spare cycles I'll take a stab at using take_bits, but I think regardless this PR represents a major improvement 👍 |
* Update test_take_list to fail with empty list * Fix null_bit_buffer to match intended array comment says: [[0,null,0], [-1,-2,3], null, [5,null]] which implies null buffer of: 0b11111011 * Fix null_bit_buffer to match intended array comment says: [[0,null,0], [-1,-2,3], [null], [5,null]] which has not null values at the list level, or a null buffer of 0b11111011 * Compute null buffer in take_value_indices_from_list this way we can distinguish empty list elements from null elements. * clippy (cherry picked from commit c28d69a)
Which issue does this PR close?
Closes #3471.
What changes are included in this PR?
74bad75 update the
test_take_slice_string
test to include an extra empty list element. This fails on its own as the empty list is converted to anull
intake
290d6cc and b01f9e5 fix the null buffer for the input data in
test_take_list_with_nulls
andtest_take_list_with_value_nulls
respectively. The previous implementation oftake
for lists actually ignores the null mask of the input list array, so my guess is that it why it wasn't noticed that these masks don't match the intended arrays expressed in the surrounding comments. Without these fixes the change below causes these tests to fail, but I think that's expected given the null masks that were previously provided.2ff14fa moves the calculation of the null buffer into
take_value_indices_from_list
where we have access to the null info of both the indices and list array.Are there any user-facing changes?
This is a bug fix, but could be considered breaking as it's been around for a while and downstream code has likely adapted to the behavior.