-
Notifications
You must be signed in to change notification settings - Fork 842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Encoding of List offsets was incorrect when slice offsets begin with zero #6805
fix: Encoding of List offsets was incorrect when slice offsets begin with zero #6805
Conversation
…with zero When encoding offsets the code had an optimization to reuse the offsets if the first offset was zero assuming the slice already pointed to first element. But the offset can also be zero if all previous lists were empty. When this occured it mold make all lists in the slice as empty, even if they shouldn't be.
arrow-ipc/src/writer.rs
Outdated
let offsets = match start_offset.as_usize() { | ||
0 => offsets.clone(), | ||
let offsets: Buffer = match start_offset.as_usize() { | ||
0 => offset_slice.iter().copied().collect(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would something like 0 => Buffer::from_slice_ref(offset_slice)
be better performing? Just curious as I'm still learning. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, that would be more efficient than the iterating approach.
… through the slice.
arrow-ipc/src/writer.rs
Outdated
let offsets = match start_offset.as_usize() { | ||
0 => offsets.clone(), | ||
let offsets: Buffer = match start_offset.as_usize() { | ||
0 => Buffer::from_slice_ref(offset_slice), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead slice the Buffer passed in, this would avoid copying?
arrow-ipc/src/writer.rs
Outdated
let offsets = match start_offset.as_usize() { | ||
0 => offsets.clone(), | ||
let offsets: Buffer = match start_offset.as_usize() { | ||
0 => Buffer::from_slice_ref(offset_slice), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
0 => Buffer::from_slice_ref(offset_slice), | |
0 => { | |
let size = std::mem::sizeof::<O>(); | |
offsets.slice_with_length(data.offset()*size, (data.offset() + data.length() + 1) * size) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good recommendation. code is changed.
CI seems to be legitimately failing |
missed that. fixed |
Looks good to me, thank you |
…with zero (apache#6805) * fix: Encoding of List offsets was incorrect when slice offsets begin with zero When encoding offsets the code had an optimization to reuse the offsets if the first offset was zero assuming the slice already pointed to first element. But the offset can also be zero if all previous lists were empty. When this occured it mold make all lists in the slice as empty, even if they shouldn't be. * Use Buffer::from_slice_ref which will be faster as it doesn't iterate through the slice. * Avoid copying * Explicitly reference std::mem::size_of
…with zero (apache#6805) * fix: Encoding of List offsets was incorrect when slice offsets begin with zero When encoding offsets the code had an optimization to reuse the offsets if the first offset was zero assuming the slice already pointed to first element. But the offset can also be zero if all previous lists were empty. When this occured it mold make all lists in the slice as empty, even if they shouldn't be. * Use Buffer::from_slice_ref which will be faster as it doesn't iterate through the slice. * Avoid copying * Explicitly reference std::mem::size_of (cherry picked from commit fe7e71a)
…with zero (#6805) (#6943) * fix: Encoding of List offsets was incorrect when slice offsets begin with zero When encoding offsets the code had an optimization to reuse the offsets if the first offset was zero assuming the slice already pointed to first element. But the offset can also be zero if all previous lists were empty. When this occured it mold make all lists in the slice as empty, even if they shouldn't be. * Use Buffer::from_slice_ref which will be faster as it doesn't iterate through the slice. * Avoid copying * Explicitly reference std::mem::size_of Co-authored-by: Michael Maletich <[email protected]>
When encoding offsets the code had an optimization to reuse the offsets if the first offset was zero assuming the slice already pointed to first element. But the offset can also be zero if all previous lists were empty. When this occured it mold make all lists in the slice as empty, even if they shouldn't be.
Which issue does this PR close?
Closes #6803 .
Rationale for this change
Fixing a bug that can lead to loss of information when encoding record batches.
What changes are included in this PR?
Are there any user-facing changes?
No, though some clients will experience smaller encoded messages since only the relevant slice will be sent.