-
Notifications
You must be signed in to change notification settings - Fork 848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Panic on reencoding offsets in arrow-ipc with sliced nested arrays #6998
Conversation
Code was incorrectly defining the end of the offset slice to be the start + slice_length * 2 because slice_with_length adds the start to the end. This caused the encoded batches to be larger than they needed to be and would result in a panic for certain slices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @HawaiianSpork
THis code makes sense to me. The one thing I would like t explore is if we can get an end to end test (aka read/writing an IPC file or stream)
Maybe by making a large array and writing array.slice(len-2, 1)
or something 🤔
@@ -2615,6 +2615,35 @@ mod tests { | |||
builder.finish() | |||
} | |||
|
|||
#[test] | |||
fn reencode_offsets_when_first_offset_is_not_zero() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that this test fials without the changes in this PR
the offset of the new Buffer cannot exceed the existing length: slice offset=4 length=12 selflen=12
thread 'writer::tests::reencode_offsets_when_first_offset_is_zero' panicked at arrow-buffer/src/buffer/immutable.rs:281:9:
the offset of the new Buffer cannot exceed the existing length: slice offset=4 length=12 selflen=12
stack backtrace:
0: rust_begin_unwind
at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/std/src/panicking.rs:665:5
1: core::panicking::panic_fmt
at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/panicking.rs:76:14
2: arrow_buffer::buffer::immutable::Buffer::slice_with_length
at /Users/andrewlamb/Software/arrow-rs/arrow-buffer/src/buffer/immutable.rs:281:9
3: arrow_ipc::writer::reencode_offsets
at ./src/writer.rs:1476:13
4: arrow_ipc::writer::tests::reencode_offsets_when_first_offset_is_zero
at ./src/writer.rs:2641:53
5: arrow_ipc::writer::tests::reencode_offsets_when_first_offset_is_zero::{{closure}}
at ./src/writer.rs:2630:52
arrow-ipc/src/writer.rs
Outdated
let original_list = generate_list_data::<i32>(); | ||
let original_data = original_list.into_data(); | ||
let slice_data = original_data.slice(75, 7); | ||
let (new_offsets, original_start, length) = reencode_offsets::<i32>(&slice_data.buffers()[0], &slice_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this bug possible to hit using IPC files, or is it only hittable if reencde_offsets
is called directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I tried to write some tests here: Add tests for writing larger sliced arrays to IPC files/streams #7004 but was not successful reproducing the issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing that. I initially had written larger integration tests but found it difficult to recreate the scenario. I can try again tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test code in #7004 . I merged it in (I hope you don't mind) and then added an end to end test that failed before the fix but passed afterwards.
arrow-ipc/src/writer.rs
Outdated
@@ -1475,7 +1475,7 @@ fn reencode_offsets<O: OffsetSizeTrait>( | |||
let size = size_of::<O>(); | |||
offsets.slice_with_length( | |||
data.offset() * size, | |||
(data.offset() + data.len() + 1) * size, | |||
(data.len() + 1) * size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that the we are re-encode the offsets to have data.len length? That makes sense to me (I had t re-read the code to convince myself -- the comments on the function aren't super helpful)
Looks like |
BTW I think this issue is important enough to hold the release for so I will be keeping an eye on it / trying to help out. I added it to the list on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @HawaiianSpork -- looks great to me.
fn test_large_slice_string_list_of_lists() { | ||
// The reason for the special test is to verify reencode_offsets which looks both at | ||
// the starting offset and the data offset. So need a dataset where the starting_offset | ||
// is zero but the data offset is not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Ah, Lists of Lists 🤯
I verified this fails without the tests in this PR like this:
the offset of the new Buffer cannot exceed the existing length: slice offset=4 length=24004 selflen=24004
thread 'writer::tests::test_large_slice_string_list_of_lists' panicked at arrow-buffer/src/buffer/immutable.rs:281:9:
the offset of the new Buffer cannot exceed the existing length: slice offset=4 length=24004 selflen=24004
💯
Thanks again @HawaiianSpork |
Code was incorrectly defining the end of the offset slice to be the start + slice_length * 2 because slice_with_length adds the start to the end. This caused the encoded batches to be larger than they needed to be and would result in a panic for certain slices.
Which issue does this PR close?
Rationale for this change
Prevent code panic during IPC write for legitimate input.
What changes are included in this PR?
Corrects the slice length and adds unit tests that previously would cause a panic.
Are there any user-facing changes?
No