Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Encoding of List offsets was incorrect when slice offsets begin with zero #6805

Merged
merged 4 commits into from
Nov 29, 2024

Conversation

HawaiianSpork
Copy link
Contributor

When encoding offsets the code had an optimization to reuse the offsets if the first offset was zero assuming the slice already pointed to first element. But the offset can also be zero if all previous lists were empty. When this occured it mold make all lists in the slice as empty, even if they shouldn't be.

Which issue does this PR close?

Closes #6803 .

Rationale for this change

Fixing a bug that can lead to loss of information when encoding record batches.

What changes are included in this PR?

  • Change the encoding and a test to demonstrate the fix.

Are there any user-facing changes?

No, though some clients will experience smaller encoded messages since only the relevant slice will be sent.

…with zero

When encoding offsets the code had an optimization to reuse the offsets if the first offset was zero assuming the slice already pointed
 to first element. But the offset can also be zero if all previous lists were empty. When this occured it mold make all lists in the
slice as empty, even if they shouldn't be.
let offsets = match start_offset.as_usize() {
0 => offsets.clone(),
let offsets: Buffer = match start_offset.as_usize() {
0 => offset_slice.iter().copied().collect(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like 0 => Buffer::from_slice_ref(offset_slice) be better performing? Just curious as I'm still learning. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, that would be more efficient than the iterating approach.

let offsets = match start_offset.as_usize() {
0 => offsets.clone(),
let offsets: Buffer = match start_offset.as_usize() {
0 => Buffer::from_slice_ref(offset_slice),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we instead slice the Buffer passed in, this would avoid copying?

let offsets = match start_offset.as_usize() {
0 => offsets.clone(),
let offsets: Buffer = match start_offset.as_usize() {
0 => Buffer::from_slice_ref(offset_slice),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

Suggested change
0 => Buffer::from_slice_ref(offset_slice),
0 => {
let size = std::mem::sizeof::<O>();
offsets.slice_with_length(data.offset()*size, (data.offset() + data.length() + 1) * size)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good recommendation. code is changed.

@tustvold
Copy link
Contributor

CI seems to be legitimately failing

@HawaiianSpork
Copy link
Contributor Author

CI seems to be legitimately failing

missed that. fixed

@tustvold
Copy link
Contributor

Looks good to me, thank you

@tustvold tustvold merged commit fe7e71a into apache:main Nov 29, 2024
26 checks passed
@HawaiianSpork HawaiianSpork deleted the fix-list-offsets-encoding branch December 17, 2024 23:26
alamb pushed a commit to alamb/arrow-rs that referenced this pull request Jan 5, 2025
…with zero (apache#6805)

* fix: Encoding of List offsets was incorrect when slice offsets begin with zero

When encoding offsets the code had an optimization to reuse the offsets if the first offset was zero assuming the slice already pointed
 to first element. But the offset can also be zero if all previous lists were empty. When this occured it mold make all lists in the
slice as empty, even if they shouldn't be.

* Use Buffer::from_slice_ref which will be faster as it doesn't iterate through the slice.

* Avoid copying

* Explicitly reference std::mem::size_of
alexwilcoxson-rel pushed a commit to relativityone/arrow-rs that referenced this pull request Jan 8, 2025
…with zero (apache#6805)

* fix: Encoding of List offsets was incorrect when slice offsets begin with zero

When encoding offsets the code had an optimization to reuse the offsets if the first offset was zero assuming the slice already pointed
 to first element. But the offset can also be zero if all previous lists were empty. When this occured it mold make all lists in the
slice as empty, even if they shouldn't be.

* Use Buffer::from_slice_ref which will be faster as it doesn't iterate through the slice.

* Avoid copying

* Explicitly reference std::mem::size_of

(cherry picked from commit fe7e71a)
alamb added a commit that referenced this pull request Jan 8, 2025
…with zero (#6805) (#6943)

* fix: Encoding of List offsets was incorrect when slice offsets begin with zero

When encoding offsets the code had an optimization to reuse the offsets if the first offset was zero assuming the slice already pointed
 to first element. But the offset can also be zero if all previous lists were empty. When this occured it mold make all lists in the
slice as empty, even if they shouldn't be.

* Use Buffer::from_slice_ref which will be faster as it doesn't iterate through the slice.

* Avoid copying

* Explicitly reference std::mem::size_of

Co-authored-by: Michael Maletich <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow Flight Encodes a Slice's List Offsets If the slice offset is starts with zero
3 participants