Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve empty list array elements in take kernel #3473

Merged
merged 5 commits into from
Jan 7, 2023

Conversation

jonmmease
Copy link
Contributor

Which issue does this PR close?

Closes #3471.

What changes are included in this PR?

74bad75 update the test_take_slice_string test to include an extra empty list element. This fails on its own as the empty list is converted to a null in take

290d6cc and b01f9e5 fix the null buffer for the input data in test_take_list_with_nulls and test_take_list_with_value_nulls respectively. The previous implementation of take for lists actually ignores the null mask of the input list array, so my guess is that it why it wasn't noticed that these masks don't match the intended arrays expressed in the surrounding comments. Without these fixes the change below causes these tests to fail, but I think that's expected given the null masks that were previously provided.

2ff14fa moves the calculation of the null buffer into take_value_indices_from_list where we have access to the null info of both the indices and list array.

Are there any user-facing changes?

This is a bug fix, but could be considered breaking as it's been around for a while and downstream code has likely adapted to the behavior.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
comment says: [[0,null,0], [-1,-2,3], null, [5,null]]
which implies null buffer of: 0b11111011
comment says: [[0,null,0], [-1,-2,3], [null], [5,null]]
which has not null values at the list level, or a null
buffer of 0b11111011
this way we can distinguish empty list elements from null elements.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 6, 2023
@@ -1519,12 +1524,12 @@ mod tests {

macro_rules! test_take_list {
($offset_type:ty, $list_data_type:ident, $list_array_type:ident) => {{
// Construct a value array, [[0,0,0], [-1,-2,-1], [2,3]]
// Construct a value array, [[0,0,0], [-1,-2,-1], [], [2,3]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was updated to add an extra empty array element and to make sure it's preserved in the output of take

@@ -1609,7 +1612,7 @@ mod tests {
let list_data = ArrayData::builder(list_data_type.clone())
.len(4)
.add_buffer(value_offsets)
.null_bit_buffer(Some(Buffer::from([0b10111101, 0b00000000])))
.null_bit_buffer(Some(Buffer::from([0b11111111])))
Copy link
Contributor Author

@jonmmease jonmmease Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended array here is [[0,null,0], [-1,-2,3], [null], [5,null]], which has no null elements in the outer list array.

@@ -1682,7 +1685,7 @@ mod tests {
let list_data = ArrayData::builder(list_data_type.clone())
.len(4)
.add_buffer(value_offsets)
.null_bit_buffer(Some(Buffer::from([0b01111101])))
.null_bit_buffer(Some(Buffer::from([0b11111011])))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intended array here is [[0,null,0], [-1,-2,3], null, [5,null]], where only the third element is null.

@@ -831,10 +815,18 @@ where
/// Where a list array has indices `[0,2,5,10]`, taking indices of `[2,0]` returns
/// an array of the indices `[5..10, 0..2]` and offsets `[0,5,7]` (5 elements and 2
/// elements)
#[allow(clippy::type_complexity)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you'd like the return type here refactored with type definitions instead of ignoring the clippy lint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to at least have some documentation of what the various fields are.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this, this looks good to me. I left some minor comments on how it could be improved further, but I'm happy for this to go in as is

@@ -831,10 +815,18 @@ where
/// Where a list array has indices `[0,2,5,10]`, taking indices of `[2,0]` returns
/// an array of the indices `[5..10, 0..2]` and offsets `[0,5,7]` (5 elements and 2
/// elements)
#[allow(clippy::type_complexity)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to at least have some documentation of what the various fields are.

Comment on lines +845 to +850

// Initialize null buffer
let num_bytes = bit_util::ceil(indices.len(), 8);
let mut null_buf = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
let null_slice = null_buf.as_slice_mut();

Copy link
Contributor

@tustvold tustvold Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than moving this into this function, I think it might be faster and simpler to use take_bits on the list null buffer in the outer function. This would also resolve the type complexity issue above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but we also need to handle the indices being null. So I think we'd need to perform an AND operation between the result of take_bits on the list's null buffer and the indices null buffer. And that seemed a bit more complex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that is what take_bits did, but my memory is a little hazy

);
}
let value_offsets = Buffer::from_slice_ref(&offsets);
let value_offsets = Buffer::from_slice_ref(offsets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not something from this PR, but it occurs to me that it would be better for take_value_indices_from_list to simply build and return this buffer, e.g. using BufferBuilder, instead of this additional copy.

@jonmmease
Copy link
Contributor Author

Thanks for looking it over and for the comments @tustvold, I'll give it another pass in a few hours

@jonmmease
Copy link
Contributor Author

I added a comment on the return types, but I wasn't confident in making the other changes. Feel free to commit onto this branch if you want to do any other cleanup. Thanks again!

@tustvold tustvold merged commit c28d69a into apache:master Jan 7, 2023
@ursabot
Copy link

ursabot commented Jan 7, 2023

Benchmark runs are scheduled for baseline = ca7ea59 and contender = c28d69a. c28d69a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@tustvold
Copy link
Contributor

tustvold commented Jan 7, 2023

Thanks again, if I have some spare cycles I'll take a stab at using take_bits, but I think regardless this PR represents a major improvement 👍

jonmmease added a commit to jonmmease/arrow-rs that referenced this pull request Jan 8, 2023
* Update test_take_list to fail with empty list

* Fix null_bit_buffer to match intended array

comment says: [[0,null,0], [-1,-2,3], null, [5,null]]
which implies null buffer of: 0b11111011

* Fix null_bit_buffer to match intended array

comment says: [[0,null,0], [-1,-2,3], [null], [5,null]]
which has not null values at the list level, or a null
buffer of 0b11111011

* Compute null buffer in take_value_indices_from_list

this way we can distinguish empty list elements from null elements.

* clippy

(cherry picked from commit c28d69a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

take kernel on List array introduces nulls instead of empty lists
3 participants