Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Panic on reencoding offsets in arrow-ipc with sliced nested arrays #6998

Merged
merged 5 commits into from
Jan 25, 2025

Conversation

HawaiianSpork
Copy link
Contributor

@HawaiianSpork HawaiianSpork commented Jan 20, 2025

Code was incorrectly defining the end of the offset slice to be the start + slice_length * 2 because slice_with_length adds the start to the end. This caused the encoded batches to be larger than they needed to be and would result in a panic for certain slices.

Which issue does this PR close?

Rationale for this change

Prevent code panic during IPC write for legitimate input.

What changes are included in this PR?

Corrects the slice length and adds unit tests that previously would cause a panic.

Are there any user-facing changes?

No

Code was incorrectly defining the end of the offset slice to be the start + slice_length * 2 because slice_with_length adds the start to the end.
This caused the encoded batches to be larger than they needed to be and would result in a panic for certain slices.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 20, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @HawaiianSpork

THis code makes sense to me. The one thing I would like t explore is if we can get an end to end test (aka read/writing an IPC file or stream)

Maybe by making a large array and writing array.slice(len-2, 1) or something 🤔

@@ -2615,6 +2615,35 @@ mod tests {
builder.finish()
}

#[test]
fn reencode_offsets_when_first_offset_is_not_zero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that this test fials without the changes in this PR

the offset of the new Buffer cannot exceed the existing length: slice offset=4 length=12 selflen=12
thread 'writer::tests::reencode_offsets_when_first_offset_is_zero' panicked at arrow-buffer/src/buffer/immutable.rs:281:9:
the offset of the new Buffer cannot exceed the existing length: slice offset=4 length=12 selflen=12
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/panicking.rs:76:14
   2: arrow_buffer::buffer::immutable::Buffer::slice_with_length
             at /Users/andrewlamb/Software/arrow-rs/arrow-buffer/src/buffer/immutable.rs:281:9
   3: arrow_ipc::writer::reencode_offsets
             at ./src/writer.rs:1476:13
   4: arrow_ipc::writer::tests::reencode_offsets_when_first_offset_is_zero
             at ./src/writer.rs:2641:53
   5: arrow_ipc::writer::tests::reencode_offsets_when_first_offset_is_zero::{{closure}}
             at ./src/writer.rs:2630:52

let original_list = generate_list_data::<i32>();
let original_data = original_list.into_data();
let slice_data = original_data.slice(75, 7);
let (new_offsets, original_start, length) = reencode_offsets::<i32>(&slice_data.buffers()[0], &slice_data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bug possible to hit using IPC files, or is it only hittable if reencde_offsets is called directly?

Copy link
Contributor

@alamb alamb Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing that. I initially had written larger integration tests but found it difficult to recreate the scenario. I can try again tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test code in #7004 . I merged it in (I hope you don't mind) and then added an end to end test that failed before the fix but passed afterwards.

@@ -1475,7 +1475,7 @@ fn reencode_offsets<O: OffsetSizeTrait>(
let size = size_of::<O>();
offsets.slice_with_length(
data.offset() * size,
(data.offset() + data.len() + 1) * size,
(data.len() + 1) * size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that the we are re-encode the offsets to have data.len length? That makes sense to me (I had t re-read the code to convince myself -- the comments on the function aren't super helpful)

@alamb
Copy link
Contributor

alamb commented Jan 20, 2025

Looks like cargo fmt needs to be run to get CI to pass:

@alamb
Copy link
Contributor

alamb commented Jan 22, 2025

BTW I think this issue is important enough to hold the release for so I will be keeping an eye on it / trying to help out. I added it to the list on

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @HawaiianSpork -- looks great to me.

fn test_large_slice_string_list_of_lists() {
// The reason for the special test is to verify reencode_offsets which looks both at
// the starting offset and the data offset. So need a dataset where the starting_offset
// is zero but the data offset is not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Ah, Lists of Lists 🤯

I verified this fails without the tests in this PR like this:

the offset of the new Buffer cannot exceed the existing length: slice offset=4 length=24004 selflen=24004
thread 'writer::tests::test_large_slice_string_list_of_lists' panicked at arrow-buffer/src/buffer/immutable.rs:281:9:
the offset of the new Buffer cannot exceed the existing length: slice offset=4 length=24004 selflen=24004

💯

@alamb alamb changed the title fix: Panic on reencoding offsets fix: Panic on reencoding offsets in arrow-ipc with sliced nested arrays Jan 23, 2025
@alamb alamb merged commit 0e40460 into apache:main Jan 25, 2025
28 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 25, 2025

Thanks again @HawaiianSpork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow IPC Writer Panics for sliced nested arrays
2 participants