Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: binary_mut should work if only one input array has null buffer #6396

Merged
merged 4 commits into from
Sep 18, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 60 additions & 4 deletions arrow-arith/src/arity.rs
Original file line number Diff line number Diff line change
Expand Up @@ -313,8 +313,6 @@ where
))));
}

let nulls = NullBuffer::union(a.logical_nulls().as_ref(), b.logical_nulls().as_ref());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only a has null buffer, NullBuffer.union will clone it which shares the underlying buffer that fails into_mutable call on the null buffer later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a by value, perhaps we could add a PrimitiveArray::into_parts that would deconstruct the PrimitiveArray into its underlying NullBuffer and values 🤔

Following the model of https://docs.rs/arrow/latest/arrow/array/struct.GenericByteArray.html#method.into_parts

That way we could directly use the NullBuffer and avoid the copy in create_union_null_buffer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let me see if it works.

Copy link
Member Author

@viirya viirya Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for binary_mut, it is okay to defer the computation of union of two null buffers. It is because that we don't need to own the null buffers before invoking op on the two arrays.

Since the null buffer is not changed on the builder, we can compute the union after finishing the builder. So for binary_mut, we can avoid copying null buffer.

But for try_binary_mut, because we need to get the union of two null buffers before invoking op on the two arrays, we still need to copy it as this PR does.

PrimitiveArray::into_parts (it already exists) also cannot help on it because the builder needs to own the null buffer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I only avoid copying null buffer in binary_mut.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still feels to me that we could remove the copy on try_binary_mut, but this seems like an improvement to me 🚀


let mut builder = a.into_builder()?;

builder
Expand All @@ -323,7 +321,12 @@ where
.zip(b.values())
.for_each(|(l, r)| *l = op(*l, *r));

let array_builder = builder.finish().into_data().into_builder().nulls(nulls);
let array = builder.finish();

// The builder has the null buffer from `a`, it is not changed.
let nulls = NullBuffer::union(array.logical_nulls().as_ref(), b.logical_nulls().as_ref());

let array_builder = array.into_data().into_builder().nulls(nulls);

let array_data = unsafe { array_builder.build_unchecked() };
Ok(Ok(PrimitiveArray::<T>::from(array_data)))
Expand Down Expand Up @@ -413,7 +416,8 @@ where
try_binary_no_nulls_mut(len, a, b, op)
} else {
let nulls =
NullBuffer::union(a.logical_nulls().as_ref(), b.logical_nulls().as_ref()).unwrap();
create_union_null_buffer(a.logical_nulls().as_ref(), b.logical_nulls().as_ref())
.unwrap();

let mut builder = a.into_builder()?;

Expand All @@ -435,6 +439,22 @@ where
}
}

/// Computes the union of the nulls in two optional [`NullBuffer`] which
/// is not shared with the input buffers.
///
/// The union of the nulls is the same as `NullBuffer::union(lhs, rhs)` but
/// it does not increase the reference count of the null buffer.
fn create_union_null_buffer(
lhs: Option<&NullBuffer>,
rhs: Option<&NullBuffer>,
) -> Option<NullBuffer> {
match (lhs, rhs) {
(Some(lhs), Some(rhs)) => Some(NullBuffer::new(lhs.inner() & rhs.inner())),
(Some(n), None) | (None, Some(n)) => Some(NullBuffer::new(n.inner() & n.inner())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, it forces a copy of the null buffer's contents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Previously it did a clone that shares the underlying buffer.

(None, None) => None,
}
}

/// This intentional inline(never) attribute helps LLVM optimize the loop.
#[inline(never)]
fn try_binary_no_nulls<A: ArrayAccessor, B: ArrayAccessor, F, O>(
Expand Down Expand Up @@ -557,6 +577,24 @@ mod tests {
assert_eq!(c, expected);
}

#[test]
fn test_binary_mut_null_buffer() {
let a = Int32Array::from(vec![Some(3), Some(4), Some(5), Some(6), None]);

let b = Int32Array::from(vec![Some(10), Some(11), Some(12), Some(13), Some(14)]);

let r1 = binary_mut(a, &b, |a, b| a + b).unwrap();

let a = Int32Array::from(vec![Some(3), Some(4), Some(5), Some(6), None]);
let b = Int32Array::new(
vec![10, 11, 12, 13, 14].into(),
Some(vec![true, true, true, true, true].into()),
);

let r2 = binary_mut(a, &b, |a, b| a + b).unwrap();
viirya marked this conversation as resolved.
Show resolved Hide resolved
assert_eq!(r1.unwrap(), r2.unwrap());
}

#[test]
fn test_try_binary_mut() {
let a = Int32Array::from(vec![15, 14, 9, 8, 1]);
Expand Down Expand Up @@ -587,6 +625,24 @@ mod tests {
.expect_err("should got error");
}

#[test]
fn test_try_binary_mut_null_buffer() {
let a = Int32Array::from(vec![Some(3), Some(4), Some(5), Some(6), None]);

let b = Int32Array::from(vec![Some(10), Some(11), Some(12), Some(13), Some(14)]);

let r1 = try_binary_mut(a, &b, |a, b| Ok(a + b)).unwrap();

let a = Int32Array::from(vec![Some(3), Some(4), Some(5), Some(6), None]);
let b = Int32Array::new(
vec![10, 11, 12, 13, 14].into(),
Some(vec![true, true, true, true, true].into()),
);

let r2 = try_binary_mut(a, &b, |a, b| Ok(a + b)).unwrap();
viirya marked this conversation as resolved.
Show resolved Hide resolved
assert_eq!(r1.unwrap(), r2.unwrap());
}

#[test]
fn test_unary_dict_mut() {
let values = Int32Array::from(vec![Some(10), Some(20), None]);
Expand Down
Loading