-
Notifications
You must be signed in to change notification settings - Fork 855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add substring
support for FixedSizeBinaryArray
#1633
Conversation
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
let new_len = match length { | ||
Some(len) => len.min(old_len - new_start), | ||
None => old_len - new_start, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that if length
is negative? DataType::FixedSizeBinary(negative)
seems invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the answer is no for several reasons:
- the definition of the public function makes sure that
length
is always >=0
pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {
- the definition of the data type of
FixedSizeBinary
seems to allow negative value, although it is somewhat weird:
/// Opaque binary data of fixed size.
/// Enum parameter specifies the number of bytes per value.
FixedSizeBinary(i32),
Although it may be more reasonable to use unsigned int to type the length
, in Apache Arrow specification, the length
must be i32
. https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout (For nested arrays, we always use signed integer to represent the length or offsets.)
A fixed size list type is specified like FixedSizeList<T>[N], where T is any type (primitive or nested) and N is a 32-bit signed integer representing the length of the lists.
What would happen if we give a negative length or negative offsets buffer?
This is a fun game!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any discussion or explanation of why we use signed integer to represent length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's right, I didn't notice that length
is u64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since unsigned integers can be more difficult to work with in some cases (e.g. in the JVM), we recommend preferring signed integers over unsigned integers for representing dictionary indices. Additionally, we recommend avoiding using 64-bit unsigned integer indices unless they are required by an application.
This is some wordings I read in the spec. It is not talking about array length, but I think it might be somehow related to the question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me in general. Thanks @HaoYang670.
// all-nulls array is always identical | ||
(vec![None, None, None], -1, Some(1), vec![None, None, None]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding test coverage.
array.data_ref().null_buffer().cloned(), | ||
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0
offset
works for the new array value new_values
. But null_buffer
is cloned, so maybe it has different offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for finding the bug. I have fixed it by getting the bit_slice
of null_buffer
. Also a corresponding test is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for other arrays' implementations we have the same bug, I will file a follow-up issue to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @HaoYang670
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1633 +/- ##
==========================================
+ Coverage 83.02% 83.08% +0.05%
==========================================
Files 193 193
Lines 55577 55772 +195
==========================================
+ Hits 46145 46338 +193
- Misses 9432 9434 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any more comments other than thank you @HaoYang670 and @viirya 👍
Thank you @HaoYang670 @alamb ! 😄 |
Which issue does this PR close?
Closes #1618.
What changes are included in this PR?
substring
kernel supportFixedSizeBinaryArray
FixedSizeBinaryArray
Are there any user-facing changes?
Rename some benchmarks.
Bench Result