Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ord): Support equality of StructArray #5217

Conversation

my-vegetable-has-exploded
Copy link
Contributor

Which issue does this PR close?

Closes #5199

Rationale for this change

What changes are included in this PR?

  • refactor closure values() to function compare_op_struct_values()
  • compare struct arrays by recursively checking each field

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 17, 2023
@my-vegetable-has-exploded
Copy link
Contributor Author

And there are some points that I am not sure about

  • for stuctarray, a struct array has its own validity bitmap that is independent of its child arrays’ validity bitmaps. So I don't handle nullbuffer for each field.
  • I don't find a way to new a structscalar, so I don't test scalar yet.

Maybe another question, how can I make these test codes shorter?

@my-vegetable-has-exploded my-vegetable-has-exploded changed the title feat: Support equality of StructArray feat(ord): Support equality of StructArray Dec 17, 2023
arrow-ord/src/cmp.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @my-vegetable-has-exploded -- this is looking pretty close to me

Can you please add tests for

  1. distinct / not_distinct
  2. A negative test that some operation like lt or lt_eq returns an error (not a panic) for struct arrays?
  3. A negative test that a struct array like {a: int, b:int} doesn't return true when compared to a struct array with a prefix like `{a:int}

Also @tustvold do you have any suggestions for what benchmarks to run this on?

arrow-ord/src/cmp.rs Outdated Show resolved Hide resolved
arrow-ord/src/cmp.rs Show resolved Hide resolved
arrow-ord/src/cmp.rs Outdated Show resolved Hide resolved
arrow-ord/src/cmp.rs Outdated Show resolved Hide resolved
arrow-ord/src/cmp.rs Show resolved Hide resolved
@tustvold
Copy link
Contributor

So I don't handle nullbuffer for each field.

I think the output should be the union of all the null buffers.

I'll try to review this in the next couple of days

Co-authored-by: Andrew Lamb <[email protected]>
@my-vegetable-has-exploded
Copy link
Contributor Author

my-vegetable-has-exploded commented Dec 19, 2023

I think the output should be the union of all the null buffers.

I think the nullbuffer for subarrays is only valid for the subarray itself.

Take the Example Layout in the documentation as an example(https://arrow.apache.org/docs/format/Columnar.html#struct-layout), if use the union of all the null buffers, the second slot also gets null, which is a little different from my understanding.

[{'joe', 1}, {null, 2}, null, {'mark', 4}]

* Length: 4, Null count: 1
* Validity bitmap buffer:

  | Byte 0 (validity bitmap) | Bytes 1-63            |
  |--------------------------|-----------------------|
  | 00001011                 | 0 (padding)           |

* Children arrays:
  * field-0 array (`VarBinary`):
    * Length: 4, Null count: 2
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      |--------------------------|-----------------------|
      | 00001001                 | 0 (padding)           |

    * Offsets buffer:

      | Bytes 0-19     | Bytes 20-63           |
      |----------------|-----------------------|
      | 0, 3, 3, 3, 7  | unspecified (padding) |

     * Value buffer:

      | Bytes 0-6      | Bytes 7-63            |
      |----------------|-----------------------|
      | joemark        | unspecified (padding) |

  * field-1 array (int32 array):
    * Length: 4, Null count: 1
    * Validity bitmap buffer:

      | Byte 0 (validity bitmap) | Bytes 1-63            |
      |--------------------------|-----------------------|
      | 00001011                 | 0 (padding)           |

    * Value Buffer:

      | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-63           |
      |-------------|-------------|-------------|-------------|-----------------------|
      | 1           | 2           | unspecified | 4           | unspecified (padding) |

thanks, @tustvold @alamb

@tustvold
Copy link
Contributor

Correct, but the semantic of these kernels is any comparison against a null results in null output for that position

@my-vegetable-has-exploded
Copy link
Contributor Author

Can you please add tests for

1. `distinct` / `not_distinct`

2. A negative test that some operation like `lt` or `lt_eq` returns an error (not a panic) for struct arrays?

3. A negative test that a struct array like `{a: int, b:int}` doesn't return `true` when compared to a struct array with a prefix like `{a:int}

Sure.

Correct, but the semantic of these kernels is any comparison against a null results in null output for that position

I was wondering if it would be better to use Op::NotDistinct to check each field? More precisely, we need to go through the process in

(Some(l), true, Some(r), true) | (Some(l), false, Some(r), false) => {
// Either both sides are scalar or neither side is scalar
match op {
Op::Distinct => {
let values = values();
let l = l.inner().bit_chunks().iter_padded();
let r = r.inner().bit_chunks().iter_padded();
let ne = values.bit_chunks().iter_padded();
let c = |((l, r), n)| ((l ^ r) | (l & r & n));
let buffer = l.zip(r).zip(ne).map(c).collect();
BooleanBuffer::new(buffer, 0, len).into()
}
Op::NotDistinct => {
let values = values();
let l = l.inner().bit_chunks().iter_padded();
let r = r.inner().bit_chunks().iter_padded();
let e = values.bit_chunks().iter_padded();
let c = |((l, r), e)| u64::not(l | r) | (l & r & e);
let buffer = l.zip(r).zip(e).map(c).collect();
BooleanBuffer::new(buffer, 0, len).into()
}
_ => BooleanArray::new(values(), NullBuffer::union(Some(&l), Some(&r))),
}
}
(Some(_), true, Some(a), false) | (Some(a), false, Some(_), true) => {
// Scalar is null, other side is non-scalar and nullable
match op {
Op::Distinct => a.into_inner().into(),
Op::NotDistinct => a.into_inner().not().into(),
_ => BooleanArray::new_null(len),
}
}
(Some(nulls), is_scalar, None, _) | (None, _, Some(nulls), is_scalar) => {
// Only one side is nullable
match is_scalar {
true => match op {
// Scalar is null, other side is not nullable
Op::Distinct => BooleanBuffer::new_set(len).into(),
Op::NotDistinct => BooleanBuffer::new_unset(len).into(),
_ => BooleanArray::new_null(len),
},
false => match op {
Op::Distinct => {
let values = values();
let l = nulls.inner().bit_chunks().iter_padded();
let ne = values.bit_chunks().iter_padded();
let c = |(l, n)| u64::not(l) | n;
let buffer = l.zip(ne).map(c).collect();
BooleanBuffer::new(buffer, 0, len).into()
}
Op::NotDistinct => (nulls.inner() & &values()).into(),
_ => BooleanArray::new(values(), Some(nulls)),
},
}
}
// Neither side is nullable
(None, _, None, _) => BooleanArray::new(values(), None),
after getting BooleanBuffer.

@tustvold
Copy link
Contributor

tustvold commented Dec 19, 2023

I was wondering if it would be better to use Op::NotDistinct to check each field?

That would be a different kernel then. We definitely could/should support distinct/not_distinct for StructArray also, the difference with standard equality is how nulls are handled. Distinct follow the intuitive notions of equality, the equality kernels follow the SQL formulation of equality and the somewhat perverse null semantics it has 😅

https://learn.microsoft.com/en-us/sql/t-sql/queries/is-distinct-from-transact-sql?view=sql-server-ver16#remarks

@my-vegetable-has-exploded
Copy link
Contributor Author

I was wondering if it would be better to use Op::NotDistinct to check each field?

That would be a different kernel then. We definitely could/should support distinct/not_distinct for StructArray also, the difference with standard equality is how nulls are handled. Distinct follow the intuitive notions of equality, the equality kernels follow the SQL formulation of equality and the somewhat perverse null semantics it has 😅

https://learn.microsoft.com/en-us/sql/t-sql/queries/is-distinct-from-transact-sql?view=sql-server-ver16#remarks

I feel like I have caught your drift this time. Because the comparison between None and any value is Unknown, So {null, 2} is also not comparable. Thanks, I will change my code based on this suggestion.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a review, I like where this is headed but I don't think the null mask handling is quite right yet.

FWIW I'm not sure that separating out the null mask and values comparison makes sense, instead I would expect the logic to just recurse across the fields and union the null masks of the results (if any), with a little bit of extra logic to handle any null mask in the struct array proper.

Comment on lines 270 to 288
if l_t.is_nested() {
if !l_t.equals_datatype(r_t) {
return Err(ArrowError::InvalidArgumentError(format!(
"Invalid comparison operation: {l_t} {op} {r_t}"
)));
}
match (l_t, op) {
(Struct(_), Op::Equal | Op::NotEqual | Op::Distinct | Op::NotDistinct) => {}
_ => {
return Err(ArrowError::InvalidArgumentError(format!(
"Invalid comparison operation: {l_t} {op} {r_t}"
)));
}
}
} else if r_t != l_t {
return Err(ArrowError::InvalidArgumentError(format!(
"Invalid comparison operation: {l_t} {op} {r_t}"
)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if l_t.is_nested() {
if !l_t.equals_datatype(r_t) {
return Err(ArrowError::InvalidArgumentError(format!(
"Invalid comparison operation: {l_t} {op} {r_t}"
)));
}
match (l_t, op) {
(Struct(_), Op::Equal | Op::NotEqual | Op::Distinct | Op::NotDistinct) => {}
_ => {
return Err(ArrowError::InvalidArgumentError(format!(
"Invalid comparison operation: {l_t} {op} {r_t}"
)));
}
}
} else if r_t != l_t {
return Err(ArrowError::InvalidArgumentError(format!(
"Invalid comparison operation: {l_t} {op} {r_t}"
)));
}
if !l_t.equals_datatype(r_t) {
return Err(ArrowError::InvalidArgumentError(format!(
"Invalid comparison operation: {l_t} {op} {r_t}"
)));
}

Comment on lines 204 to 214
let l_t = l.data_type();
let r_t = r.data_type();
let l_nulls = l.logical_nulls().filter(|n| n.null_count() > 0);
let r_nulls = r.logical_nulls().filter(|n| n.null_count() > 0);
// for [not]Distinct, the result is never null
match op {
Op::Distinct | Op::NotDistinct => {
return Ok(None);
}
_ => {}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let l_t = l.data_type();
let r_t = r.data_type();
let l_nulls = l.logical_nulls().filter(|n| n.null_count() > 0);
let r_nulls = r.logical_nulls().filter(|n| n.null_count() > 0);
// for [not]Distinct, the result is never null
match op {
Op::Distinct | Op::NotDistinct => {
return Ok(None);
}
_ => {}
}
if matches!(op, Op::Distinct | Op::NotDistinct) {
// for [not]Distinct, the result is never null
return Ok(None)
}
let l_t = l.data_type();
let r_t = r.data_type();
let l_nulls = l.logical_nulls().filter(|n| n.null_count() > 0);
let r_nulls = r.logical_nulls().filter(|n| n.null_count() > 0);

Comment on lines 383 to 384
// when one of field is equal, the result is false for not equal
// so we use neg to reverse the result of equal when handle not equal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just pass the operator into compare_op_values?

.columns()
.iter()
.zip(r.columns().iter())
.map(|(col_l, col_r)| compare_op_values(Op::Equal, col_l, l_s, col_r, r_s, len))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will correctly handle the null masks for a Distinct?

Some(vec![true, false, true, true].into()),
));
let right_a = Arc::new(Int32Array::new(
vec![0, 1, 2, 3].into(),
Copy link
Contributor

@tustvold tustvold Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vec![0, 1, 2, 3].into(),
vec![0, 72, 2, 3].into(),

This helps verify the null mask comparison is correct, and not relying on the values comparison

],
Buffer::from([0b0111]),
));
let right_struct = StructArray::from((
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let right_struct = StructArray::from((
// right [{a: 0, b: 0}, {a: NULL, b: 1}, {a: 2, b: 2}, {a: 3, b: 3} ]
let right_struct = StructArray::from((

));
let field_a = Arc::new(Field::new("a", DataType::Int32, true));
let field_b = Arc::new(Field::new("b", DataType::Int32, true));
let left_struct = StructArray::from((
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let left_struct = StructArray::from((
// [{a: 0, b: 0}, {a: NULL, b: 1}, {a: 2, b: 20}, {a: 3, b: 3}]
let left_struct = StructArray::from((

@my-vegetable-has-exploded
Copy link
Contributor Author

Had a review, I like where this is headed but I don't think the null mask handling is quite right yet.

FWIW I'm not sure that separating out the null mask and values comparison makes sense, instead I would expect the logic to just recurse across the fields and union the null masks of the results (if any), with a little bit of extra logic to handle any null mask in the struct array proper.

I wanted to do that at first. The main reason is that I found it's hard to reuse those codes

Ok(match (l_nulls, l_s, r_nulls, r_s) {
(Some(l), true, Some(r), true) | (Some(l), false, Some(r), false) => {
// Either both sides are scalar or neither side is scalar
match op {
Op::Distinct => {
let values = values();
let l = l.inner().bit_chunks().iter_padded();
let r = r.inner().bit_chunks().iter_padded();
let ne = values.bit_chunks().iter_padded();
let c = |((l, r), n)| ((l ^ r) | (l & r & n));
let buffer = l.zip(r).zip(ne).map(c).collect();
BooleanBuffer::new(buffer, 0, len).into()
}
Op::NotDistinct => {
let values = values();
let l = l.inner().bit_chunks().iter_padded();
let r = r.inner().bit_chunks().iter_padded();
let e = values.bit_chunks().iter_padded();
let c = |((l, r), e)| u64::not(l | r) | (l & r & e);
let buffer = l.zip(r).zip(e).map(c).collect();
BooleanBuffer::new(buffer, 0, len).into()
}
_ => BooleanArray::new(values(), NullBuffer::union(Some(&l), Some(&r))),
}
}
(Some(_), true, Some(a), false) | (Some(a), false, Some(_), true) => {
// Scalar is null, other side is non-scalar and nullable
match op {
Op::Distinct => a.into_inner().into(),
Op::NotDistinct => a.into_inner().not().into(),
_ => BooleanArray::new_null(len),
}
}
(Some(nulls), is_scalar, None, _) | (None, _, Some(nulls), is_scalar) => {
// Only one side is nullable
match is_scalar {
true => match op {
// Scalar is null, other side is not nullable
Op::Distinct => BooleanBuffer::new_set(len).into(),
Op::NotDistinct => BooleanBuffer::new_unset(len).into(),
_ => BooleanArray::new_null(len),
},
false => match op {
Op::Distinct => {
let values = values();
let l = nulls.inner().bit_chunks().iter_padded();
let ne = values.bit_chunks().iter_padded();
let c = |(l, n)| u64::not(l) | n;
let buffer = l.zip(ne).map(c).collect();
BooleanBuffer::new(buffer, 0, len).into()
}
Op::NotDistinct => (nulls.inner() & &values()).into(),
_ => BooleanArray::new(values(), Some(nulls)),
},
}
}
// Neither side is nullable
(None, _, None, _) => BooleanArray::new(values(), None),
})

If there is a better way to organize those codes, I'd like to have a try! Thanks a lot!

@tustvold
Copy link
Contributor

It should be possible to just call compare_op recursively

@tustvold
Copy link
Contributor

I'll have a play later today/tomorrow and see if I can't simplify this a bit

@my-vegetable-has-exploded
Copy link
Contributor Author

I'll have a play later today/tomorrow and see if I can't simplify this a bit

Thanks a lot,I’m sorry to add to your workload.

@Jefffrey
Copy link
Contributor

Hey @tustvold & @my-vegetable-has-exploded , do we know the status of this PR now? It's been open for a bit and it seems there has been another PR for the same issue in the meantime, #5423, so wondering if efforts should be focused on a single PR? Otherwise can keep both open but mark this as draft as there hasn't been movement for a bit?

@tustvold
Copy link
Contributor

Sorry this is partly on me, I'm somewhat struggling to keep up with all the various things going on. I think my preference is towards something along the lines of #5672 which would allow us to handle StructArray more comprehensively in the comparison kernels, instead of having non-trivial logic just for the case of equality.

I think let's mark this as a draft and I will try to find sometime next week to sort something out in this space

@tustvold tustvold marked this pull request as draft April 26, 2024 13:46
@tustvold tustvold closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support equality of StructArray
5 participants